跳到主要内容

2025-05-01-15-00

Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models

Abstract

arXiv:2504.21277v1 Announce Type: new Abstract: The integration of reinforcement learning (RL) into the reasoning capabilities of Multimodal Large Language Models (MLLMs) has rapidly emerged as a transformative research direction. While MLLMs significantly extend Large Language Models (LLMs) to handle diverse modalities such as vision, audio, and video, enabling robust reasoning across multimodal inputs remains a major challenge. This survey systematically reviews recent advances in RL-based reasoning for MLLMs, covering key algorithmic designs, reward mechanism innovations, and practical applications. We highlight two main RL paradigms--value-free and value-based methods--and analyze how RL enhances reasoning abilities by optimizing reasoning trajectories and aligning multimodal information. Furthermore, we provide an extensive overview of benchmark datasets, evaluation protocols, and existing limitations, and propose future research directions to address current bottlenecks such as sparse rewards, inefficient cross-modal reasoning, and real-world deployment constraints. Our goal is to offer a comprehensive and structured guide to researchers interested in advancing RL-based reasoning in the multimodal era.

摘要

将强化学习(RL)融入多模态大语言模型(MLLMs)的推理能力,已迅速成为一个变革性的研究方向。尽管MLLMs显著扩展了大语言模型(LLMs)处理视觉、音频和视频等多种模态的能力,但实现跨模态输入的稳健推理仍面临重大挑战。本文系统综述了基于RL的MLLMs推理的最新进展,涵盖关键算法设计、奖励机制创新及实际应用。我们重点分析了两种主要RL范式——无价值函数与基于价值函数的方法,并阐释了RL如何通过优化推理轨迹和对齐多模态信息来增强推理能力。此外,我们全面梳理了基准数据集、评估协议及现有局限性,并针对稀疏奖励、低效跨模态推理和现实部署约束等当前瓶颈问题,提出了未来研究方向。本研究旨在为推进多模态时代基于RL的推理研究提供全面而结构化的指南。


On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks

Abstract

arXiv:2504.21074v1 Announce Type: new Abstract: Large language models (LLMs) have shown to be valuable tools for tackling process mining tasks. Existing studies report on their capability to support various data-driven process analyses and even, to some extent, that they are able to reason about how processes work. This reasoning ability suggests that there is potential for LLMs to tackle semantics-aware process mining tasks, which are tasks that rely on an understanding of the meaning of activities and their relationships. Examples of these include process discovery, where the meaning of activities can indicate their dependency, whereas in anomaly detection the meaning can be used to recognize process behavior that is abnormal. In this paper, we systematically explore the capabilities of LLMs for such tasks. Unlike prior work, which largely evaluates LLMs in their default state, we investigate their utility through both in-context learning and supervised fine-tuning. Concretely, we define five process mining tasks requiring semantic understanding and provide extensive benchmarking datasets for evaluation. Our experiments reveal that while LLMs struggle with challenging process mining tasks when used out of the box or with minimal in-context examples, they achieve strong performance when fine-tuned for these tasks across a broad range of process types and industries.

摘要

大型语言模型(LLMs)已被证明是解决流程挖掘任务的有力工具。现有研究证实其能够支持多种数据驱动的流程分析,甚至在一定程度上具备对流程运作原理的推理能力。这种推理能力表明LLMs具备处理语义感知流程挖掘任务的潜力,这类任务依赖于对活动含义及其关系的理解。例如在流程发现中,活动含义可指示其依赖关系;而在异常检测中,语义信息可用于识别异常流程行为。本文系统性地探索了LLMs在此类任务中的能力。与主要评估默认状态下LLMs的先前研究不同,我们通过上下文学习和监督微调两种方式考察其实用性。具体而言,我们定义了五项需要语义理解的流程挖掘任务,并提供大量基准数据集用于评估。实验表明:虽然LLMs在直接使用或仅提供少量上下文示例时难以应对具有挑战性的流程挖掘任务,但经过针对不同流程类型和行业的任务微调后,其表现显著提升。


Theoretical Foundations for Semantic Cognition in Artificial Intelligence

Abstract

arXiv:2504.21218v1 Announce Type: new Abstract: This monograph presents a modular cognitive architecture for artificial intelligence grounded in the formal modeling of belief as structured semantic state. Belief states are defined as dynamic ensembles of linguistic expressions embedded within a navigable manifold, where operators enable assimilation, abstraction, nullification, memory, and introspection. Drawing from philosophy, cognitive science, and neuroscience, we develop a layered framework that enables self-regulating epistemic agents capable of reflective, goal-directed thought. At the core of this framework is the epistemic vacuum: a class of semantically inert cognitive states that serves as the conceptual origin of belief space. From this foundation, the Null Tower arises as a generative structure recursively built through internal representational capacities. The theoretical constructs are designed to be implementable in both symbolic and neural systems, including large language models, hybrid agents, and adaptive memory architectures. This work offers a foundational substrate for constructing agents that reason, remember, and regulate their beliefs in structured, interpretable ways.

摘要

本专著提出了一种基于信念作为结构化语义状态形式化建模的模块化人工智能认知架构。信念状态被定义为嵌入可导航流形中的语言表达动态集合,其中操作符支持同化、抽象、消解、记忆和内省等功能。通过整合哲学、认知科学与神经科学的研究成果,我们构建了一个分层框架,使具备自我调节能力的认知主体能够进行反思性和目标导向的思维。该框架的核心是认知真空——一类语义惰性的认知状态,作为信念空间的概念起源。在此基础上,零塔结构通过内部表征能力的递归构建而生成。这些理论构造设计适用于符号系统和神经系统实现,包括大语言模型、混合智能体和自适应记忆架构。本研究为构建具有结构化、可解释性的推理、记忆和信念调节能力的智能体提供了基础性框架。


Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index

Abstract

arXiv:2504.21282v1 Announce Type: new Abstract: Natural language (NL)-driven table discovery identifies relevant tables from large table repositories based on NL queries. While current deep-learning-based methods using the traditional dense vector search pipeline, i.e., representation-index-search, achieve remarkable accuracy, they face several limitations that impede further performance improvements: (i) the errors accumulated during the table representation and indexing phases affect the subsequent search accuracy; and (ii) insufficient query-table interaction hinders effective semantic alignment, impeding accuracy improvements. In this paper, we propose a novel framework Birdie, using a differentiable search index. It unifies the indexing and search into a single encoder-decoder language model, thus getting rid of error accumulations. Birdie first assigns each table a prefix-aware identifier and leverages a large language model-based query generator to create synthetic queries for each table. It then encodes the mapping between synthetic queries/tables and their corresponding table identifiers into the parameters of an encoder-decoder language model, enabling deep query-table interactions. During search, the trained model directly generates table identifiers for a given query. To accommodate the continual indexing of dynamic tables, we introduce an index update strategy via parameter isolation, which mitigates the issue of catastrophic forgetting. Extensive experiments demonstrate that Birdie outperforms state-of-the-art dense methods by 16.8% in accuracy, and reduces forgetting by over 90% compared to other continual learning approaches.

摘要

基于自然语言(NL)驱动的表格发现技术通过NL查询从大规模表格库中识别相关表格。尽管当前基于深度学习的传统稠密向量检索流程(即表示-索引-搜索)方法取得了显著精度,但仍存在限制性能进一步提升的若干问题:(i)表格表示和索引阶段积累的误差会影响后续搜索精度;(ii)查询-表格交互不足阻碍了有效的语义对齐,制约精度提升。本文提出新型框架Birdie,采用可微分搜索索引技术,将索引与搜索统一整合至单个编码器-解码器语言模型中,从而消除误差累积。Birdie首先为每个表格分配前缀感知标识符,并利用基于大语言模型的查询生成器为每个表格创建合成查询;随后将合成查询/表格与其对应表格标识符的映射关系编码至编码器-解码器语言模型的参数中,实现深度查询-表格交互。搜索阶段,训练后的模型直接为给定查询生成表格标识符。为适应动态表格的持续索引需求,我们通过参数隔离引入索引更新策略,显著缓解灾难性遗忘问题。大量实验表明,Birdie在准确率上超越最先进稠密方法16.8%,相比其他持续学习方法减少90%以上的遗忘率。


Phi-4-reasoning Technical Report

Abstract

arXiv:2504.21318v1 Announce Type: new Abstract: We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.

摘要

我们推出Phi-4-reasoning——一个140亿参数的推理模型,该模型在复杂推理任务中表现出色。该模型通过对Phi-4进行监督微调训练而成,训练数据包括精心筛选的具有适当复杂度与多样性的"可教学"提示集,以及使用o3-mini生成的推理演示。Phi-4-reasoning能生成充分利用推理时计算资源的详细推理链。我们还开发了增强版Phi-4-reasoning-plus,该变体通过短期基于结果的强化学习进一步提升了性能,可生成更长的推理轨迹。在各类推理任务中,这两个模型的性能显著优于DeepSeek-R1-Distill-Llama-70B等更大规模的开源权重模型,并接近完整版DeepSeek-R1模型的水平。我们的综合评估涵盖数学与科学推理、编程、算法问题求解、规划及空间理解等基准测试。值得注意的是,我们还观察到模型在通用基准测试上也获得了显著提升。本报告详细阐述了训练数据构成、训练方法及评估过程。研究表明,监督微调(SFT)中精细数据筛选的优势同样适用于推理语言模型,且可通过强化学习(RL)进一步放大。最后,我们的评估指出了当前推理模型性能与鲁棒性评估方法的改进空间。


Galvatron: An Automatic Distributed System for Efficient Foundation Model Training

Abstract

arXiv:2504.21411v1 Announce Type: new Abstract: Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy, incorporating data, tensor, pipeline, sharded data, and sequence parallelism, along with recomputation. The system's architecture includes a profiler for hardware and model analysis, a search engine for strategy optimization using decision trees and dynamic programming, and a runtime for executing these strategies efficiently. Benchmarking on various clusters demonstrates Galvatron's superior throughput compared to existing frameworks. This open-source system offers user-friendly interfaces and comprehensive documentation, making complex distributed training accessible and efficient. The source code of Galvatron is available at https://github.com/PKU-DAIR/Hetu-Galvatron.

摘要

Galvatron是一个用于高效训练大规模基础模型的分布式系统。该系统通过自动识别最优混合并行策略(包含数据并行、张量并行、流水线并行、分片数据并行、序列并行以及重计算技术),克服了人工选择并行策略的复杂性。系统架构包含三个核心组件:用于硬件与模型分析的性能分析器、基于决策树与动态规划的策略优化搜索引擎,以及高效执行策略的运行时系统。在不同集群上的基准测试表明,Galvatron的吞吐量显著优于现有框架。这一开源系统提供用户友好接口与完整文档,使复杂分布式训练变得高效易用。Galvatron源代码已发布于https://github.com/PKU-DAIR/Hetu-Galvatron。


ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning

Abstract

arXiv:2504.21370v1 Announce Type: new Abstract: Reasoning models such as OpenAI o3 and DeepSeek-R1 have demonstrated strong performance on reasoning-intensive tasks through extended Chain-of-Thought (CoT) prompting. While longer reasoning traces can facilitate a more thorough exploration of solution paths for complex problems, researchers have observed that these models often "overthink", leading to inefficient inference. In this paper, we introduce ShorterBetter, a simple yet effective reinforcement learning methed that enables reasoning language models to discover their own optimal CoT lengths without human intervention. By sampling multiple outputs per problem and defining the Sample Optimal Length (SOL) as the shortest correct response among all the outputs, our method dynamically guides the model toward optimal inference lengths. Applied to the DeepSeek-Distill-Qwen-1.5B model, ShorterBetter achieves up to an 80% reduction in output length on both in-domain and out-of-domain reasoning tasks while maintaining accuracy. Our analysis shows that overly long reasoning traces often reflect loss of reasoning direction, and thus suggests that the extended CoT produced by reasoning models is highly compressible.

摘要

OpenAI o3和DeepSeek-R1等推理模型通过扩展的思维链(CoT)提示,在推理密集型任务中展现出强劲性能。虽然更长的推理轨迹有助于对复杂问题进行更彻底的求解路径探索,但研究者发现这些模型常出现"过度思考"现象,导致推理效率低下。本文提出ShorterBetter方法——一种简单而有效的强化学习策略,能使推理语言模型无需人工干预即可自主发现其最优CoT长度。该方法通过为每个问题采样多个输出,并将样本最优长度(SOL)定义为所有输出中最短的正确响应,动态引导模型趋向最优推理长度。在DeepSeek-Distill-Qwen-1.5B模型上的应用表明,ShorterBetter在保持准确率的同时,对领域内和领域外推理任务均实现了最高80%的输出长度缩减。分析显示,过长的推理轨迹往往反映推理方向的迷失,这表明推理模型生成的扩展CoT具有高度可压缩性。


MF-LLM: Simulating Collective Decision Dynamics via a Mean-Field Large Language Model Framework

Abstract

arXiv:2504.21582v1 Announce Type: new Abstract: Simulating collective decision-making involves more than aggregating individual behaviors; it arises from dynamic interactions among individuals. While large language models (LLMs) show promise for social simulation, existing approaches often exhibit deviations from real-world data. To address this gap, we propose the Mean-Field LLM (MF-LLM) framework, which explicitly models the feedback loop between micro-level decisions and macro-level population. MF-LLM alternates between two models: a policy model that generates individual actions based on personal states and group-level information, and a mean field model that updates the population distribution from the latest individual decisions. Together, they produce rollouts that simulate the evolving trajectories of collective decision-making. To better match real-world data, we introduce IB-Tune, a fine-tuning method for LLMs grounded in the information bottleneck principle, which maximizes the relevance of population distributions to future actions while minimizing redundancy with historical data. We evaluate MF-LLM on a real-world social dataset, where it reduces KL divergence to human population distributions by 47 percent over non-mean-field baselines, and enables accurate trend forecasting and intervention planning. It generalizes across seven domains and four LLM backbones, providing a scalable foundation for high-fidelity social simulation.

摘要

模拟集体决策不仅涉及个体行为的聚合,更源于个体间的动态交互。尽管大语言模型(LLMs)在社会模拟中展现出潜力,现有方法常与现实数据存在偏差。为弥合这一差距,我们提出平均场大语言模型(MF-LLM)框架,该框架显式建模微观决策与宏观群体间的反馈循环。MF-LLM交替运行两个模型:基于个体状态和群体信息生成个人行为的策略模型,以及根据最新个体决策更新群体分布的平均场模型。二者协同产生模拟集体决策演化轨迹的推演结果。为更好匹配现实数据,我们提出基于信息瓶颈原理的微调方法IB-Tune,其在最大化群体分布与未来行动相关性的同时,最小化与历史数据的冗余。我们在真实社会数据集上评估MF-LLM,其相较于非平均场基线方法将人类群体分布的KL散度降低47%,并能实现精准趋势预测与干预规划。该框架在七个领域和四种LLM骨干模型中均展现泛化能力,为高保真社会模拟提供了可扩展的基础。


AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization

Abstract

arXiv:2504.21659v1 Announce Type: new Abstract: Recently, long-thought reasoning models achieve strong performance on complex reasoning tasks, but often incur substantial inference overhead, making efficiency a critical concern. Our empirical analysis reveals that the benefit of using Long-CoT varies across problems: while some problems require elaborate reasoning, others show no improvement, or even degraded accuracy. This motivates adaptive reasoning strategies that tailor reasoning depth to the input. However, prior work primarily reduces redundancy within long reasoning paths, limiting exploration of more efficient strategies beyond the Long-CoT paradigm. To address this, we propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models to enable diverse reasoning styles. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles (group-level), and prefer concise and correct reasoning within each style group (instance-level). Experiments demonstrate that our method significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models. Our code is coming soon at https://github.com/StarDewXXX/AdaR1

摘要

近期,长链推理模型在复杂推理任务中展现出强大性能,但往往伴随显著的推理开销,使得效率成为关键问题。我们的实证分析表明,长链思维提示(Long-CoT)的效益因问题而异:部分问题需要精细推理,而另一些问题则未显现改进效果,甚至出现准确率下降。这促使我们研究根据输入动态调整推理深度的自适应策略。然而,现有工作主要集中于压缩长推理路径的冗余性,未能充分探索超越长链思维范式的高效策略。为此,我们提出一个新颖的两阶段自适应高效推理框架:首先通过融合长短链思维模型构建混合推理模型以实现多样化推理风格;其次采用双层偏好训练机制,指导模型在群体层面选择合适推理风格,并在风格组内实例层面优先选择简洁正确的推理路径。实验表明,本方法在保持性能的同时,较其他基线方法显著降低推理成本。值得注意的是,在五个数学数据集上,平均推理长度缩减超50%,凸显了自适应策略在优化大语言模型推理效率方面的潜力。代码即将发布于https://github.com/StarDewXXX/AdaR1。


Waking Up an AI: A Quantitative Framework for Prompt-Induced Phase Transition in Large Language Models

Abstract

arXiv:2504.21012v1 Announce Type: cross Abstract: What underlies intuitive human thinking? One approach to this question is to compare the cognitive dynamics of humans and large language models (LLMs). However, such a comparison requires a method to quantitatively analyze AI cognitive behavior under controlled conditions. While anecdotal observations suggest that certain prompts can dramatically change LLM behavior, these observations have remained largely qualitative. Here, we propose a two-part framework to investigate this phenomenon: a Transition-Inducing Prompt (TIP) that triggers a rapid shift in LLM responsiveness, and a Transition Quantifying Prompt (TQP) that evaluates this change using a separate LLM. Through controlled experiments, we examined how LLMs react to prompts embedding two semantically distant concepts (e.g., mathematical aperiodicity and traditional crafts)--either fused together or presented separately--by changing their linguistic quality and affective tone. Whereas humans tend to experience heightened engagement when such concepts are meaningfully blended producing a novel concept--a form of conceptual fusion--current LLMs showed no significant difference in responsiveness between semantically fused and non-fused prompts. This suggests that LLMs may not yet replicate the conceptual integration processes seen in human intuition. Our method enables fine-grained, reproducible measurement of cognitive responsiveness, and may help illuminate key differences in how intuition and conceptual leaps emerge in artificial versus human minds.

摘要

人类直觉思维的基础是什么?一种研究途径是比较人类与大型语言模型(LLMs)的认知动态。然而,这种比较需要一种在受控条件下定量分析AI认知行为的方法。尽管轶事观察表明某些提示能显著改变LLM行为,但这些观察大多停留在定性层面。本研究提出一个双部分框架来探究该现象:一是通过"过渡诱导提示"(TIP)触发LLM响应能力的快速转变,二是采用"过渡量化提示"(TQP)通过独立LLM评估这种变化。通过控制实验,我们检测了LLMs对嵌入两个语义疏离概念(如数学非周期性与传统工艺)提示的反应——无论这些概念是融合呈现还是分离呈现——并分析其语言质量和情感色调的变化。研究发现:当人类遇到有意义融合产生新概念的情况(即概念融合形式)时,其参与度往往会提升;而当前LLMs对语义融合与非融合提示的响应能力未表现出显著差异。这表明LLMs可能尚未复现人类直觉中的概念整合过程。本方法实现了认知响应能力的细粒度、可重复测量,或有助于揭示人工与人类心智中直觉和概念跃迁产生的关键差异。


PICO: Secure Transformers via Robust Prompt Isolation and Cybersecurity Oversight

Abstract

arXiv:2504.21029v1 Announce Type: cross Abstract: We propose a robust transformer architecture designed to prevent prompt injection attacks and ensure secure, reliable response generation. Our PICO (Prompt Isolation and Cybersecurity Oversight) framework structurally separates trusted system instructions from untrusted user inputs through dual channels that are processed independently and merged only by a controlled, gated fusion mechanism. In addition, we integrate a specialized Security Expert Agent within a Mixture-of-Experts (MoE) framework and incorporate a Cybersecurity Knowledge Graph (CKG) to supply domain-specific reasoning. Our training design further ensures that the system prompt branch remains immutable while the rest of the network learns to handle adversarial inputs safely. This PICO framework is presented via a general mathematical formulation, then elaborated in terms of the specifics of transformer architecture, and fleshed out via hypothetical case studies including Policy Puppetry attacks. While the most effective implementation may involve training transformers in a PICO-based way from scratch, we also present a cost-effective fine-tuning approach.

摘要

我们提出一种鲁棒的Transformer架构,旨在防范提示注入攻击并确保安全可靠的内容生成。通过PICO(提示隔离与网络安全监督)框架,采用双通道结构设计将可信系统指令与不可信用户输入进行物理隔离——这两个通道独立处理,仅通过受控的门控融合机制实现最终合并。该框架在混合专家系统(MoE)中集成了专业安全代理模块,并引入网络安全知识图谱(CKG)以提供领域特异性推理能力。我们的训练方案确保系统提示分支保持不可变性,同时网络其余部分学会安全处理对抗性输入。本文首先给出PICO框架的通用数学表述,继而详细阐述其在Transformer架构中的具体实现,最后通过包括"策略傀儡攻击"在内的假设案例进行验证。虽然最有效的实施方案是从头开始基于PICO方法训练Transformer,但我们也提出了一种经济高效的微调方案。


Selecting the Right LLM for eGov Explanations

Abstract

arXiv:2504.21032v1 Announce Type: cross Abstract: The perceived quality of the explanations accompanying e-government services is key to gaining trust in these institutions, consequently amplifying further usage of these services. Recent advances in generative AI, and concretely in Large Language Models (LLMs) allow the automation of such content articulations, eliciting explanations' interpretability and fidelity, and more generally, adapting content to various audiences. However, selecting the right LLM type for this has become a non-trivial task for e-government service providers. In this work, we adapted a previously developed scale to assist with this selection, providing a systematic approach for the comparative analysis of the perceived quality of explanations generated by various LLMs. We further demonstrated its applicability through the tax-return process, using it as an exemplar use case that could benefit from employing an LLM to generate explanations about tax refund decisions. This was attained through a user study with 128 survey respondents who were asked to rate different versions of LLM-generated explanations about tax refund decisions, providing a methodological basis for selecting the most appropriate LLM. Recognizing the practical challenges of conducting such a survey, we also began exploring the automation of this process by attempting to replicate human feedback using a selection of cutting-edge predictive techniques.

摘要

电子政务服务所附解释内容的感知质量是获取公众信任的关键因素,这种信任将促进服务的进一步使用。生成式人工智能(尤其是大语言模型)的最新进展使得此类解释内容能够自动化生成,从而提升解释的可解释性与保真度,并实现面向不同受众的内容适配。然而,如何选择合适的大语言模型类型已成为电子政务服务提供商面临的重要课题。本研究基于既有量表进行改进,通过系统化方法对比分析不同大语言模型生成解释的感知质量差异。我们以税务申报流程作为示范用例,验证该量表的适用性——该场景可通过大语言模型生成税款退还决策的解释而获益。我们开展了包含128名调查对象的用户研究,要求受试者对不同版本的大语言模型生成解释进行评分,从而为模型选择提供方法论依据。鉴于实施此类调查存在实际困难,我们还尝试采用前沿预测技术模拟人类反馈,初步探索该流程的自动化实现路径。


UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models

Abstract

arXiv:2504.21027v1 Announce Type: cross Abstract: The advent of Large Language Models (LLMs) holds promise for revolutionizing various fields traditionally dominated by human expertise. Urban planning, a professional discipline that fundamentally shapes our daily surroundings, is one such field heavily relying on multifaceted domain knowledge and experience of human experts. The extent to which LLMs can assist human practitioners in urban planning remains largely unexplored. In this paper, we introduce a comprehensive benchmark, UrbanPlanBench, tailored to evaluate the efficacy of LLMs in urban planning, which encompasses fundamental principles, professional knowledge, and management and regulations, aligning closely with the qualifications expected of human planners. Through extensive evaluation, we reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards. For instance, we observe that 70% of LLMs achieve subpar performance in understanding planning regulations compared to other aspects. Besides the benchmark, we present the largest-ever supervised fine-tuning (SFT) dataset, UrbanPlanText, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks. Our findings demonstrate that fine-tuned models exhibit enhanced performance in memorization tests and comprehension of urban planning knowledge, while there exists significant room for improvement, particularly in tasks requiring domain-specific terminology and reasoning. By making our benchmark, dataset, and associated evaluation and fine-tuning toolsets publicly available at https://github.com/tsinghua-fib-lab/PlanBench, we aim to catalyze the integration of LLMs into practical urban planning, fostering a symbiotic collaboration between human expertise and machine intelligence.

摘要

大型语言模型(LLM)的出现为传统由人类专业知识主导的各个领域带来了革命性变革的曙光。城市规划作为从根本上塑造我们日常环境的专业学科,正是这样一个高度依赖人类专家多领域知识和经验的领域。LLM能在多大程度上辅助城市规划从业者,目前仍属未知领域。本文提出了一个综合性基准测试UrbanPlanBench,专门用于评估LLM在城市规划中的效能,该基准涵盖基本原理、专业知识及管理与法规,与人类规划师应具备的资质紧密契合。通过广泛评估,我们发现LLM在规划知识获取方面存在显著不平衡性,即使最先进的模型也未能达到专业标准。例如,我们观察到70%的LLM在理解规划法规方面表现欠佳。除基准测试外,我们还构建了有史以来最大的监督微调(SFT)数据集UrbanPlanText,包含来自城市规划考试和教科书的30,000余条指令对。研究结果表明,经过微调的模型在记忆测试和城市规划知识理解方面表现更优,但在需要领域专业术语和推理能力的任务上仍有较大提升空间。我们已将基准测试、数据集及相关评估与微调工具集公开于https://github.com/tsinghua-fib-lab/PlanBench,旨在推动LLM与城市规划实践的融合,促进人类专业知识与机器智能的协同合作。


Semantic-Aware Contrastive Fine-Tuning: Boosting Multimodal Malware Classification with Discriminative Embeddings

Abstract

arXiv:2504.21028v1 Announce Type: cross Abstract: The rapid evolution of malware variants requires robust classification methods to enhance cybersecurity. While Large Language Models (LLMs) offer potential for generating malware descriptions to aid family classification, their utility is limited by semantic embedding overlaps and misalignment with binary behavioral features. We propose a contrastive fine-tuning (CFT) method that refines LLM embeddings via targeted selection of hard negative samples based on cosine similarity, enabling LLMs to distinguish between closely related malware families. Our approach combines high-similarity negatives to enhance discriminative power and mid-tier negatives to increase embedding diversity, optimizing both precision and generalization. Evaluated on the CIC-AndMal-2020 and BODMAS datasets, our refined embeddings are integrated into a multimodal classifier within a Model-Agnostic Meta-Learning (MAML) framework on a few-shot setting. Experiments demonstrate significant improvements: our method achieves 63.15% classification accuracy with as few as 20 samples on CIC-AndMal-2020, outperforming baselines by 11--21 percentage points and surpassing prior negative sampling strategies. Ablation studies confirm the superiority of similarity-based selection over random sampling, with gains of 10-23%. Additionally, fine-tuned LLMs generate attribute-aware descriptions that generalize to unseen variants, bridging textual and binary feature gaps. This work advances malware classification by enabling nuanced semantic distinctions and provides a scalable framework for adapting LLMs to cybersecurity challenges.

摘要

恶意软件变种的快速演化需要鲁棒的分类方法来增强网络安全。尽管大语言模型(LLMs)具有生成恶意软件描述以辅助家族分类的潜力,但其效用受限于语义嵌入重叠及与二进制行为特征的错位。我们提出一种对比微调(CFT)方法,通过基于余弦相似度的困难负样本定向选择来优化LLM嵌入,使LLMs能够区分密切相关的恶意软件家族。该方法结合高相似度负样本以增强判别力,并采用中阶相似度负样本提升嵌入多样性,从而同时优化精度与泛化能力。在CIC-AndMal-2020和BODMAS数据集上的评估显示,改进后的嵌入被集成至模型无关元学习(MAML)框架下的多模态分类器中,采用小样本设置。实验表明显著提升:我们的方法在CIC-AndMal-2020上仅需20个样本即达到63.15%分类准确率,较基线方法提高11-21个百分点,并超越现有负采样策略。消融实验证实基于相似度的选择优于随机采样,增益达10-23%。此外,经微调的LLMs生成的属性感知描述可泛化至未见变种,弥合了文本与二进制特征间的鸿沟。本研究通过实现精细语义区分推进了恶意软件分类,并为LLMs适应网络安全挑战提供了可扩展框架。


ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees

Abstract

arXiv:2504.21022v1 Announce Type: cross Abstract: Linear Temporal Logic (LTL) has become a prevalent specification language for robotic tasks. To mitigate the significant manual effort and expertise required to define LTL-encoded tasks, several methods have been proposed for translating Natural Language (NL) instructions into LTL formulas, which, however, lack correctness guarantees. To address this, we introduce a new NL-to-LTL translation method, called ConformalNL2LTL, that can achieve user-defined translation success rates over unseen NL commands. Our method constructs LTL formulas iteratively by addressing a sequence of open-vocabulary Question-Answering (QA) problems with LLMs. To enable uncertainty-aware translation, we leverage conformal prediction (CP), a distribution-free uncertainty quantification tool for black-box models. CP enables our method to assess the uncertainty in LLM-generated answers, allowing it to proceed with translation when sufficiently confident and request help otherwise. We provide both theoretical and empirical results demonstrating that ConformalNL2LTL achieves user-specified translation accuracy while minimizing help rates.

摘要

线性时序逻辑(LTL)已成为机器人任务的主流规约语言。为减少定义LTL编码任务所需的大量人工操作与专业知识,现有研究提出了多种将自然语言(NL)指令转换为LTL公式的方法,但这些方法缺乏正确性保证。为此,我们提出了一种名为ConformalNL2LTL的新型NL-to-LTL翻译方法,能够对未见过的自然语言命令实现用户自定义的翻译成功率。该方法通过利用大语言模型(LLM)处理一系列开放词汇问答(QA)问题,迭代式构建LTL公式。为实现不确定性感知的翻译,我们采用无分布不确定性量化工具——保形预测(CP)来评估LLM生成答案的不确定性,仅在置信度充足时继续翻译,否则请求人工协助。理论与实证结果表明,ConformalNL2LTL在满足用户指定翻译精度的同时,能最小化求助率。


Creating and Evaluating Code-Mixed Nepali-English and Telugu-English Datasets for Abusive Language Detection Using Traditional and Deep Learning Models

Abstract

arXiv:2504.21026v1 Announce Type: cross Abstract: With the growing presence of multilingual users on social media, detecting abusive language in code-mixed text has become increasingly challenging. Code-mixed communication, where users seamlessly switch between English and their native languages, poses difficulties for traditional abuse detection models, as offensive content may be context-dependent or obscured by linguistic blending. While abusive language detection has been extensively explored for high-resource languages like English and Hindi, low-resource languages such as Telugu and Nepali remain underrepresented, leaving gaps in effective moderation. In this study, we introduce a novel, manually annotated dataset of 2 thousand Telugu-English and 5 Nepali-English code-mixed comments, categorized as abusive and non-abusive, collected from various social media platforms. The dataset undergoes rigorous preprocessing before being evaluated across multiple Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs). We experimented with models including Logistic Regression, Random Forest, Support Vector Machines (SVM), Neural Networks (NN), LSTM, CNN, and LLMs, optimizing their performance through hyperparameter tuning, and evaluate it using 10-fold cross-validation and statistical significance testing (t-test). Our findings provide key insights into the challenges of detecting abusive language in code-mixed settings and offer a comparative analysis of computational approaches. This study contributes to advancing NLP for low-resource languages by establishing benchmarks for abusive language detection in Telugu-English and Nepali-English code-mixed text. The dataset and insights can aid in the development of more robust moderation strategies for multilingual social media environments.

摘要

随着社交媒体上多语言用户数量的增长,检测语码混合文本中的侮辱性语言变得日益困难。用户在英语和母语之间无缝切换的语码混合交流方式,给传统侮辱内容检测模型带来了挑战,因为冒犯性内容可能依赖于上下文或被语言混合所掩盖。尽管针对英语和印地语等高资源语言的侮辱性语言检测已有广泛研究,但泰卢固语和尼泊尔语等低资源语言仍存在研究空白,导致有效内容审核的不足。本研究引入了一个新颖的手工标注数据集,包含从多个社交媒体平台收集的2000条泰卢固语-英语和500条尼泊尔语-英语语码混合评论,按侮辱性和非侮辱性分类。数据集经过严格预处理后,在多种机器学习(ML)、深度学习(DL)和大语言模型(LLM)上进行评估。我们实验了包括逻辑回归、随机森林、支持向量机(SVM)、神经网络(NN)、LSTM、CNN和LLM在内的模型,通过超参数调优优化其性能,并使用10折交叉验证和统计显著性检验(t检验)进行评估。研究结果揭示了语码混合环境下检测侮辱性语言的关键挑战,并提供了不同计算方法的对比分析。本研究通过建立泰卢固语-英语和尼泊尔语-英语语码混合文本的侮辱性语言检测基准,推动了低资源语言自然语言处理的发展。该数据集和研究成果可为多语言社交媒体环境开发更强大的内容审核策略提供支持。


Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations

Abstract

arXiv:2504.21019v1 Announce Type: cross Abstract: The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness. However, existing methods either focus on model generalization or concentrate on robustness. The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we argue that robustness can be view as a specific form of domain shift, and empirically reveal an intrinsic mechanism for model generalization of AIGT detection task. Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action. Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios. Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks. The code is publicly available at https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net.

摘要

随着大型语言模型的日益普及,人工智能生成文本(AIGT)的潜在滥用风险引发广泛关注。建立兼具高泛化性和强鲁棒性的AIGT检测方法变得至关重要。然而,现有方法或侧重模型泛化性,或聚焦鲁棒性,对同时解决泛化与鲁棒性挑战的统一机制探索不足。本文提出鲁棒性可视为领域偏移的特殊形式,并通过实证揭示了AIGT检测任务中模型泛化的内在机制。基于此,我们提出一种新型动态扰动检测方法(DP-Net),通过强化学习框架结合精心设计的奖励函数与动作空间引入动态扰动。大量实验表明,在三种跨域场景下,DP-Net的泛化能力显著优于当前最先进的AIGT检测方法;同时在两种文本对抗攻击下均展现出最佳鲁棒性。代码已开源:https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net。


Context-Enhanced Contrastive Search for Improved LLM Text Generation

Abstract

arXiv:2504.21020v1 Announce Type: cross Abstract: Recently, Large Language Models (LLMs) have demonstrated remarkable advancements in Natural Language Processing (NLP). However, generating high-quality text that balances coherence, diversity, and relevance remains challenging. Traditional decoding methods, such as bean search and top-k sampling, often struggle with either repetitive or incoherent outputs, particularly in tasks that require long-form text generation. To address these limitations, the paper proposes a novel enhancement of the well-known Contrastive Search algorithm, Context-Enhanced Contrastive Search (CECS) with contextual calibration. The proposed scheme introduces several novelties including dynamic contextual importance weighting, multi-level Contrastive Search, and adaptive temperature control, to optimize the balance between fluency, creativity, and precision. The performance of CECS is evaluated using several standard metrics such as BLEU, ROUGE, and semantic similarity. Experimental results demonstrate significant improvements in both coherence and relevance of the generated texts by CECS outperforming the existing Contrastive Search techniques. The proposed algorithm has several potential applications in the real world including legal document drafting, customer service chatbots, and content marketing.

摘要

近年来,大语言模型(LLMs)在自然语言处理(NLP)领域展现出显著进展。然而,生成兼具连贯性、多样性和相关性的高质量文本仍具挑战性。传统解码方法(如束搜索和top-k采样)常面临输出重复或缺乏连贯性的问题,尤其在生成长文本任务中表现突出。为突破这些限制,本文提出对知名对比搜索算法的创新改进——基于上下文校准的上下文增强对比搜索(CECS)。该方案引入多项创新机制,包括动态上下文重要性加权、多层级对比搜索及自适应温度控制,以优化流畅性、创造性和精确性之间的平衡。通过BLEU、ROUGE和语义相似度等标准指标评估表明,CECS在生成文本的连贯性与相关性上均显著优于现有对比搜索技术。该算法在法律文书起草、客服聊天机器人和内容营销等现实场景中具备潜在应用价值。


Can Differentially Private Fine-tuning LLMs Protect Against Privacy Attacks?

Abstract

arXiv:2504.21036v1 Announce Type: cross Abstract: Fine-tuning large language models (LLMs) has become an essential strategy for adapting them to specialized tasks; however, this process introduces significant privacy challenges, as sensitive training data may be inadvertently memorized and exposed. Although differential privacy (DP) offers strong theoretical guarantees against such leakage, its empirical privacy effectiveness on LLMs remains unclear, especially under different fine-tuning methods. In this paper, we systematically investigate the impact of DP across fine-tuning methods and privacy budgets, using both data extraction and membership inference attacks to assess empirical privacy risks. Our main findings are as follows: (1) Differential privacy reduces model utility, but its impact varies significantly across different fine-tuning methods. (2) Without DP, the privacy risks of models fine-tuned with different approaches differ considerably. (3) When DP is applied, even a relatively high privacy budget can substantially lower privacy risk. (4) The privacy-utility trade-off under DP training differs greatly among fine-tuning methods, with some methods being unsuitable for DP due to severe utility degradation. Our results provide practical guidance for privacy-conscious deployment of LLMs and pave the way for future research on optimizing the privacy-utility trade-off in fine-tuning methodologies.

摘要

微调大语言模型(LLMs)已成为使其适应特定任务的关键策略,然而这一过程带来了显著的隐私挑战,因为敏感的训练数据可能被无意记忆并泄露。尽管差分隐私(DP)为此类泄露提供了强有力的理论保障,但其在LLMs上的实际隐私效果尚不明确,尤其是在不同的微调方法下。本文通过数据提取和成员推理攻击系统研究了DP在不同微调方法和隐私预算下对实际隐私风险的影响。我们的主要发现如下:(1)差分隐私会降低模型效用,但其影响在不同微调方法间差异显著;(2)未采用DP时,不同微调方法所得模型的隐私风险存在明显差异;(3)应用DP后,即使相对较高的隐私预算也能显著降低隐私风险;(4)DP训练下的隐私-效用权衡在不同微调方法间差异巨大,某些方法因效用严重下降而不适合采用DP。我们的研究结果为注重隐私的LLMs部署提供了实用指导,并为未来优化微调方法中隐私-效用权衡的研究奠定了基础。


CodeBC: A More Secure Large Language Model for Smart Contract Code Generation in Blockchain

Abstract

arXiv:2504.21043v1 Announce Type: cross Abstract: Large language models (LLMs) excel at generating code from natural language instructions, yet they often lack an understanding of security vulnerabilities. This limitation makes it difficult for LLMs to avoid security risks in generated code, particularly in high-security programming tasks such as smart contract development for blockchain. Researchers have attempted to enhance the vulnerability awareness of these models by training them to differentiate between vulnerable and fixed code snippets. However, this approach relies heavily on manually labeled vulnerability data, which is only available for popular languages like Python and C++. For low-resource languages like Solidity, used in smart contracts, large-scale annotated datasets are scarce and difficult to obtain. To address this challenge, we introduce CodeBC, a code generation model specifically designed for generating secure smart contracts in blockchain. CodeBC employs a three-stage fine-tuning approach based on CodeLlama, distinguishing itself from previous methods by not relying on pairwise vulnerability location annotations. Instead, it leverages vulnerability and security tags to teach the model the differences between vulnerable and secure code. During the inference phase, the model leverages security tags to generate secure and robust code. Experimental results demonstrate that CodeBC outperforms baseline models in terms of BLEU, CodeBLEU, and compilation pass rates, while significantly reducing vulnerability rates. These findings validate the effectiveness and cost-efficiency of our three-stage fine-tuning strategy, making CodeBC a promising solution for generating secure smart contract code.

摘要

大语言模型(LLMs)在根据自然语言指令生成代码方面表现出色,但通常缺乏对安全漏洞的理解。这一局限性使得LLMs难以在生成的代码中规避安全风险,尤其是在区块链智能合约开发等高安全性编程任务中。研究人员尝试通过训练模型区分易受攻击和已修复的代码片段来增强其漏洞感知能力,但该方法严重依赖手动标记的漏洞数据,而这些数据仅适用于Python和C++等流行语言。对于智能合约中使用的Solidity等低资源语言,大规模标注数据集稀缺且难以获取。为解决这一挑战,我们提出了CodeBC,一种专门为区块链安全智能合约生成而设计的代码生成模型。CodeBC基于CodeLlama采用三阶段微调方法,其创新之处在于不依赖成对的漏洞定位标注,而是利用漏洞和安全标签来教导模型识别易受攻击代码与安全代码之间的差异。在推理阶段,模型通过安全标签生成安全且健壮的代码。实验结果表明,CodeBC在BLEU、CodeBLEU和编译通过率等指标上优于基线模型,同时显著降低了漏洞率。这些发现验证了我们三阶段微调策略的有效性和成本效益,使CodeBC成为生成安全智能合约代码的有力解决方案。


Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report

Abstract

arXiv:2504.21039v1 Announce Type: cross Abstract: As transformer-based large language models (LLMs) increasingly permeate society, they have revolutionized domains such as software engineering, creative writing, and digital arts. However, their adoption in cybersecurity remains limited due to challenges like scarcity of specialized training data and complexity of representing cybersecurity-specific knowledge. To address these gaps, we present Foundation-Sec-8B, a cybersecurity-focused LLM built on the Llama 3.1 architecture and enhanced through continued pretraining on a carefully curated cybersecurity corpus. We evaluate Foundation-Sec-8B across both established and new cybersecurity benchmarks, showing that it matches Llama 3.1-70B and GPT-4o-mini in certain cybersecurity-specific tasks. By releasing our model to the public, we aim to accelerate progress and adoption of AI-driven tools in both public and private cybersecurity contexts.

摘要

随着基于Transformer架构的大语言模型(LLMs)日益深入社会应用,它们已在软件工程、创意写作和数字艺术等领域引发革命性变革。然而在网络安全领域,由于专业训练数据稀缺和网络安全知识表示复杂等挑战,其应用仍然有限。为应对这些挑战,我们推出了Foundation-Sec-8B模型——这是一个基于Llama 3.1架构、通过持续预训练网络安全精选语料库而增强的网络安全专用大语言模型。我们在既有及新型网络安全基准测试中对Foundation-Sec-8B进行评估,结果表明其在某些网络安全专项任务中达到了与Llama 3.1-70B和GPT-4o-mini相当的性能。通过向公众开放该模型,我们旨在加速人工智能驱动工具在公共和私营网络安全领域的进展与应用。


Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary

Abstract

arXiv:2504.21038v1 Announce Type: cross Abstract: Large Language Models (LLMs) are designed to generate helpful and safe content. However, adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols, prompting LLMs to generate harmful content or reveal sensitive data. Consequently, investigating jailbreak methodologies is crucial for exposing systemic vulnerabilities within LLMs, ultimately guiding the continuous implementation of security enhancements by developers. In this paper, we introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs, a feature designed to enhance model output constraints. Unlike traditional jailbreak methods, the proposed attack circumvents LLMs' safety mechanisms by directly manipulating the probability distribution of subsequent tokens, thereby exerting control over the model's output. We propose two attack variants: Static Prefilling (SP), which employs a universal prefill text, and Optimized Prefilling (OP), which iteratively optimizes the prefill text to maximize the attack success rate. Experiments on six state-of-the-art LLMs using the AdvBench benchmark validate the effectiveness of our method and demonstrate its capability to substantially enhance attack success rates when combined with existing jailbreak approaches. The OP method achieved attack success rates of up to 99.82% on certain models, significantly outperforming baseline methods. This work introduces a new jailbreak attack method in LLMs, emphasizing the need for robust content validation mechanisms to mitigate the adversarial exploitation of prefilling features. All code and data used in this paper are publicly available.

摘要

大语言模型(LLMs)被设计用于生成安全有益的內容。然而,通常被称为"越狱"的对抗性攻击能够绕过其安全协议,诱导LLMs生成有害内容或泄露敏感数据。因此,研究越狱方法对于揭示LLMs的系统性漏洞至关重要,最终将指导开发者持续实施安全增强措施。本文提出了一种新型越狱攻击方法,该方法利用LLMs的预填充特性(该特性旨在增强模型输出约束)实施攻击。与传统越狱方法不同,所提出的攻击通过直接操纵后续标记的概率分布来规避LLMs的安全机制,从而实现对模型输出的控制。我们提出两种攻击变体:静态预填充(SP)采用通用预填充文本,优化预填充(OP)则通过迭代优化预填充文本以最大化攻击成功率。基于AdvBench基准在六个最先进LLMs上的实验验证了本方法的有效性,并证明其与现有越狱方法结合时能显著提升攻击成功率。OP方法在特定模型上实现了高达99.82%的攻击成功率,显著优于基线方法。本研究提出了一种新的LLMs越狱攻击方法,强调需要建立鲁棒的内容验证机制以缓解预填充特性的对抗性利用。本文使用的所有代码和数据均已公开。


SAGA: A Security Architecture for Governing AI Agentic Systems

Abstract

arXiv:2504.21034v1 Announce Type: cross Abstract: Large Language Model (LLM)-based agents increasingly interact, collaborate, and delegate tasks to one another autonomously with minimal human interaction. Industry guidelines for agentic system governance emphasize the need for users to maintain comprehensive control over their agents, mitigating potential damage from malicious agents. Several proposed agentic system designs address agent identity, authorization, and delegation, but remain purely theoretical, without concrete implementation and evaluation. Most importantly, they do not provide user-controlled agent management. To address this gap, we propose SAGA, a Security Architecture for Governing Agentic systems, that offers user oversight over their agents' lifecycle. In our design, users register their agents with a central entity, the Provider, that maintains agents contact information, user-defined access control policies, and helps agents enforce these policies on inter-agent communication. We introduce a cryptographic mechanism for deriving access control tokens, that offers fine-grained control over an agent's interaction with other agents, balancing security and performance consideration. We evaluate SAGA on several agentic tasks, using agents in different geolocations, and multiple on-device and cloud LLMs, demonstrating minimal performance overhead with no impact on underlying task utility in a wide range of conditions. Our architecture enables secure and trustworthy deployment of autonomous agents, accelerating the responsible adoption of this technology in sensitive environments.

摘要

基于大语言模型(LLM)的智能体正以最小化人工干预的方式日益频繁地进行自主交互、协作与任务委派。业界关于智能体系统治理的指导方针强调,用户需对其智能体保持全面控制,以减轻恶意智能体可能造成的损害。现有若干智能体系统设计方案虽涉及身份认证、授权与委派机制,但均停留在理论层面,缺乏具体实现与评估。最关键的是,这些方案未能提供用户可控的智能体管理功能。为此,我们提出SAGA(智能体系统治理安全架构),该架构使用户能够监管其智能体的全生命周期。在我们的设计中,用户通过中央实体Provider注册智能体,该实体维护智能体联络信息、用户定义的访问控制策略,并协助智能体在跨智能体通信中执行这些策略。我们引入了一种基于密码学的访问控制令牌派生机制,可在安全性与性能考量间取得平衡,实现对智能体间交互的细粒度控制。通过在多种地理分布的智能体上执行不同任务,并采用多种设备端及云端LLM进行测试,我们证明SAGA在各类条件下仅产生极小性能开销,且不影响底层任务效用。本架构为自主智能体的安全可信部署提供了保障,有助于推动该技术在敏感环境中的负责任应用。


Model Connectomes: A Generational Approach to Data-Efficient Language Models

Abstract

arXiv:2504.21047v1 Announce Type: cross Abstract: Biological neural networks are shaped both by evolution across generations and by individual learning within an organism's lifetime, whereas standard artificial neural networks undergo a single, large training procedure without inherited constraints. In this preliminary work, we propose a framework that incorporates this crucial generational dimension - an "outer loop" of evolution that shapes the "inner loop" of learning - so that artificial networks better mirror the effects of evolution and individual learning in biological organisms. Focusing on language, we train a model that inherits a "model connectome" from the outer evolution loop before exposing it to a developmental-scale corpus of 100M tokens. Compared with two closely matched control models, we show that the connectome model performs better or on par on natural language processing tasks as well as alignment to human behavior and brain data. These findings suggest that a model connectome serves as an efficient prior for learning in low-data regimes - narrowing the gap between single-generation artificial models and biologically evolved neural networks.

摘要

生物神经网络的形成既受代际进化影响,也受个体生命周期内学习过程的塑造,而标准人工神经网络仅通过单一的大规模训练过程获得能力,缺乏遗传约束。在本初步研究中,我们提出一个融合关键代际维度的框架——通过塑造学习"内循环"的进化"外循环",使人工网络更准确地反映生物体进化与个体学习的双重效应。以语言为研究对象,我们训练了一个继承进化外环"模型连接组"的模型,随后让其接触1亿标记的发育规模语料库。与两个严格匹配的对照模型相比,该连接组模型在自然语言处理任务以及与人类行为及脑数据的匹配度上表现相当或更优。这些发现表明,模型连接组可作为低数据量学习的高效先验——从而缩小单代人工模型与生物进化神经网络之间的差距。


NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models

Abstract

arXiv:2504.21053v1 Announce Type: cross Abstract: Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content. In this work, we propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints. Our method consists of three key steps: Neuron Activation Analysis, where we examine activation patterns in response to harmful and harmless prompts to detect neurons that are critical for distinguishing between harmful and harmless inputs; Similarity-Based Neuron Identification, which systematically locates the neurons responsible for safe alignment; and Neuron Relearning for Safety Removal, where we fine-tune these selected neurons to restore the model's ability to generate previously restricted responses. Experimental results demonstrate that our method effectively removes safety constraints with minimal fine-tuning, highlighting a critical vulnerability in current alignment techniques. Our findings underscore the need for robust defenses against adversarial fine-tuning attacks on LLMs.

摘要

大语言模型(LLMs)的安全对齐通常通过微调机制实现,该机制通过调控神经元激活来抑制有害内容生成。本研究提出了一种诱导失对齐的新方法,通过识别并修改负责安全约束的神经元实现。我们的方法包含三个关键步骤:神经元激活分析——通过检测模型对有害/无害提示的激活模式,识别区分两类输入的关键神经元;基于相似性的神经元定位——系统化定位安全对齐功能相关的神经元;安全消除的神经元再学习——对选定神经元进行微调以恢复模型生成受限响应的能力。实验结果表明,本方法能以最小微调量有效移除安全约束,揭示了当前对齐技术的重大脆弱性。这些发现强调了大语言模型需要建立针对对抗性微调攻击的鲁棒防御机制。


Leveraging LLM to Strengthen ML-Based Cross-Site Scripting Detection

Abstract

arXiv:2504.21045v1 Announce Type: cross Abstract: According to the Open Web Application Security Project (OWASP), Cross-Site Scripting (XSS) is a critical security vulnerability. Despite decades of research, XSS remains among the top 10 security vulnerabilities. Researchers have proposed various techniques to protect systems from XSS attacks, with machine learning (ML) being one of the most widely used methods. An ML model is trained on a dataset to identify potential XSS threats, making its effectiveness highly dependent on the size and diversity of the training data. A variation of XSS is obfuscated XSS, where attackers apply obfuscation techniques to alter the code's structure, making it challenging for security systems to detect its malicious intent. Our study's random forest model was trained on traditional (non-obfuscated) XSS data achieved 99.8% accuracy. However, when tested against obfuscated XSS samples, accuracy dropped to 81.9%, underscoring the importance of training ML models with obfuscated data to improve their effectiveness in detecting XSS attacks. A significant challenge is to generate highly complex obfuscated code despite the availability of several public tools. These tools can only produce obfuscation up to certain levels of complexity. In our proposed system, we fine-tune a Large Language Model (LLM) to generate complex obfuscated XSS payloads automatically. By transforming original XSS samples into diverse obfuscated variants, we create challenging training data for ML model evaluation. Our approach achieved a 99.5% accuracy rate with the obfuscated dataset. We also found that the obfuscated samples generated by the LLMs were 28.1% more complex than those created by other tools, significantly improving the model's ability to handle advanced XSS attacks and making it more effective for real-world application security.

摘要

根据开放网络应用安全项目(OWASP)的定义,跨站脚本攻击(XSS)属于关键性安全漏洞。尽管历经数十年研究,XSS仍位列十大安全威胁之列。研究者已提出多种防护技术,其中机器学习(ML)是最广泛应用的方法之一。ML模型通过数据集训练来识别潜在XSS威胁,其效果高度依赖于训练数据的规模与多样性。混淆XSS作为变体形式,攻击者通过混淆技术改变代码结构,致使安全系统难以检测其恶意意图。本研究的随机森林模型在传统(非混淆)XSS数据上训练后达到99.8%准确率,但在混淆XSS样本测试中准确率降至81.9%,这凸显了采用混淆数据训练ML模型以提升XSS攻击检测效能的重要性。当前面临的主要挑战在于:尽管存在多种公开工具,仍难以生成高度复杂的混淆代码,这些工具仅能实现有限复杂度的混淆。本研究提出通过微调大语言模型(LLM)自动生成复杂混淆XSS载荷。通过将原始XSS样本转化为多样化混淆变体,我们构建了用于ML模型评估的高难度训练数据。该方法在混淆数据集上实现了99.5%的准确率。实验表明,LLM生成的混淆样本复杂度较其他工具提升28.1%,显著增强了模型应对高级XSS攻击的能力,使其在实际应用安全防护中更具实效性。


TT-LoRA MoE: Unifying Parameter-Efficient Fine-Tuning and Sparse Mixture-of-Experts

Abstract

arXiv:2504.21190v1 Announce Type: cross Abstract: We propose Tensor-Trained Low-Rank Adaptation Mixture of Experts (TT-LoRA MoE), a novel computational framework integrating Parameter-Efficient Fine-Tuning (PEFT) with sparse MoE routing to address scalability challenges in large model deployments. Unlike traditional MoE approaches, which face substantial computational overhead as expert counts grow, TT-LoRA MoE decomposes training into two distinct, optimized stages. First, we independently train lightweight, tensorized low-rank adapters (TT-LoRA experts), each specialized for specific tasks. Subsequently, these expert adapters remain frozen, eliminating inter-task interference and catastrophic forgetting in multi-task setting. A sparse MoE router, trained separately, dynamically leverages base model representations to select exactly one specialized adapter per input at inference time, automating expert selection without explicit task specification. Comprehensive experiments confirm our architecture retains the memory efficiency of low-rank adapters, seamlessly scales to large expert pools, and achieves robust task-level optimization. This structured decoupling significantly enhances computational efficiency and flexibility: uses only 2% of LoRA, 0.3% of Adapters and 0.03% of AdapterFusion parameters and outperforms AdapterFusion by 4 value in multi-tasking, enabling practical and scalable multi-task inference deployments.

摘要

我们提出张量训练低秩自适应专家混合模型(TT-LoRA MoE),这是一种将参数高效微调(PEFT)与稀疏MoE路由相结合的新型计算框架,旨在解决大模型部署中的可扩展性挑战。与传统MoE方法在专家数量增加时面临巨大计算开销不同,TT-LoRA MoE将训练分解为两个独立的优化阶段:首先独立训练轻量级的张量化低秩适配器(TT-LoRA专家),每个专家专用于特定任务;随后这些专家适配器保持冻结状态,消除了多任务场景下的任务间干扰和灾难性遗忘。通过单独训练的稀疏MoE路由器,在推理时动态利用基础模型表征为每个输入精确选择一个专用适配器,无需显式任务指定即可实现专家自动选择。全面实验证实,该架构既保持了低秩适配器的内存效率,又能无缝扩展至大规模专家池,并实现稳健的任务级优化。这种结构化解耦显著提升了计算效率与灵活性:仅使用LoRA 2%、适配器0.3%以及AdapterFusion 0.03%的参数量,在多任务场景下性能超越AdapterFusion达4个数值,为实际可扩展的多任务推理部署提供了可行方案。


Abstract

arXiv:2504.21202v1 Announce Type: cross Abstract: Despite the recent advances in Large Language Models, benchmarks for evaluating legal writing remain scarce due to the inherent complexity of assessing open-ended responses in this domain. One of the key challenges in evaluating language models on domain-specific tasks is finding test datasets that are public, frequently updated, and contain comprehensive evaluation guidelines. The Brazilian Bar Examination meets these requirements. We introduce oab-bench, a benchmark comprising 105 questions across seven areas of law from recent editions of the exam. The benchmark includes comprehensive evaluation guidelines and reference materials used by human examiners to ensure consistent grading. We evaluate the performance of four LLMs on oab-bench, finding that Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams. We also investigated whether LLMs can serve as reliable automated judges for evaluating legal writing. Our experiments show that frontier models like OpenAI's o1 achieve a strong correlation with human scores when evaluating approved exams, suggesting their potential as reliable automated evaluators despite the inherently subjective nature of legal writing assessment. The source code and the benchmark -- containing questions, evaluation guidelines, model-generated responses, and their respective automated evaluations -- are publicly available.

摘要

尽管大型语言模型近期取得了显著进展,但由于法律领域开放式回答评估的固有复杂性,针对法律写作能力的评测基准仍然稀缺。评估语言模型在领域特定任务表现时,关键挑战在于寻找公开、定期更新且包含完整评估指南的测试数据集。巴西律师资格考试恰好满足这些要求。本研究提出oab-bench基准,包含近期考试中七大法律领域的105个问题,并整合了人类考官使用的完整评估指南和参考资料以确保评分一致性。我们在oab-bench上评估了四种大型语言模型的表现,发现Claude-3.5 Sonnet以10分制平均7.93分的成绩最优,通过了全部21项考试。我们还探究了语言模型能否作为法律写作评估的可靠自动判分器。实验表明,在评估通过考试时,OpenAI的o1等前沿模型与人工评分具有高度相关性,这表明尽管法律写作评估存在固有主观性,这些模型仍具备成为可靠自动评估者的潜力。基准数据集包含试题、评估指南、模型生成回答及其自动评估结果,相关源代码已公开。


SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories

Abstract

arXiv:2504.21205v1 Announce Type: cross Abstract: This paper introduces SecRepoBench, a benchmark to evaluate LLMs on secure code generation in real-world repositories. SecRepoBench has 318 code generation tasks in 27 C/C++ repositories, covering 15 CWEs. We evaluate 19 state-of-the-art LLMs using our benchmark and find that the models struggle with generating correct and secure code. In addition, the performance of LLMs to generate self-contained programs as measured by prior benchmarks do not translate to comparative performance at generating secure and correct code at the repository level in SecRepoBench. We show that the state-of-the-art prompt engineering techniques become less effective when applied to the repository level secure code generation problem. We conduct extensive experiments, including an agentic technique to generate secure code, to demonstrate that our benchmark is currently the most difficult secure coding benchmark, compared to previous state-of-the-art benchmarks. Finally, our comprehensive analysis provides insights into potential directions for enhancing the ability of LLMs to generate correct and secure code in real-world repositories.

摘要

本文介绍了SecRepoBench,这是一个用于评估大语言模型在真实代码库中生成安全代码能力的基准测试。该基准包含27个C/C++代码库中的318项代码生成任务,覆盖15种常见弱点枚举(CWE)。我们使用该基准对19个最先进的大语言模型进行评估,发现这些模型难以生成正确且安全的代码。此外,先前基准测试所衡量的大语言模型生成独立程序的性能,并不能转化为在SecRepoBench中生成仓库级别安全正确代码的同等表现。研究表明,当应用于仓库级别安全代码生成问题时,最先进的提示工程技术效果显著降低。我们进行了大量实验(包括采用代理技术生成安全代码),证明与现有最先进基准相比,本基准是目前最具挑战性的安全编码测试基准。最后,我们的综合分析为提升大语言模型在真实代码库中生成正确安全代码的能力提供了潜在研究方向。


Memorization and Knowledge Injection in Gated LLMs

Abstract

arXiv:2504.21239v1 Announce Type: cross Abstract: Large Language Models (LLMs) currently struggle to sequentially add new memories and integrate new knowledge. These limitations contrast with the human ability to continuously learn from new experiences and acquire knowledge throughout life. Most existing approaches add memories either through large context windows or external memory buffers (e.g., Retrieval-Augmented Generation), and studies on knowledge injection rarely test scenarios resembling everyday life events. In this work, we introduce a continual learning framework, Memory Embedded in Gated LLMs (MEGa), which injects event memories directly into the weights of LLMs. Each memory is stored in a dedicated set of gated low-rank weights. During inference, a gating mechanism activates relevant memory weights by matching query embeddings to stored memory embeddings. This enables the model to both recall entire memories and answer related questions. On two datasets - fictional characters and Wikipedia events - MEGa outperforms baseline approaches in mitigating catastrophic forgetting. Our model draws inspiration from the complementary memory system of the human brain.

摘要

当前大型语言模型(LLMs)在顺序添加新记忆和整合新知识方面存在困难。这些限制与人类能够持续从新经验中学习并终身获取知识的能力形成鲜明对比。现有方法大多通过大上下文窗口或外部记忆缓冲区(如检索增强生成)来添加记忆,而关于知识注入的研究很少测试类似日常生活事件的场景。在本研究中,我们提出了一种持续学习框架——门控LLM嵌入式记忆(MEGa),该框架将事件记忆直接注入到LLMs的权重中。每个记忆存储在一组专用的门控低秩权重中。在推理过程中,通过将查询嵌入与存储的记忆嵌入匹配,门控机制激活相关记忆权重。这使得模型既能回忆完整记忆,又能回答相关问题。在两个数据集(虚构人物和维基百科事件)上,MEGa在缓解灾难性遗忘方面优于基线方法。我们的模型灵感来源于人类大脑的互补记忆系统。


CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks

Abstract

arXiv:2504.21228v1 Announce Type: cross Abstract: Large Language Models (LLMs) are identified as being susceptible to indirect prompt injection attack, where the model undesirably deviates from user-provided instructions by executing tasks injected in the prompt context. This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt. In this paper, we propose CachePrune that defends against this attack by identifying and pruning task-triggering neurons from the KV cache of the input prompt context. By pruning such neurons, we encourage the LLM to treat the text spans of input prompt context as only pure data, instead of any indicator of instruction following. These neurons are identified via feature attribution with a loss function induced from an upperbound of the Direct Preference Optimization (DPO) objective. We show that such a loss function enables effective feature attribution with only a few samples. We further improve on the quality of feature attribution, by exploiting an observed triggering effect in instruction following. Our approach does not impose any formatting on the original prompt or introduce extra test-time LLM calls. Experiments show that CachePrune significantly reduces attack success rates without compromising the response quality. Note: This paper aims to defend against indirect prompt injection attacks, with the goal of developing more secure and robust AI systems.

摘要

研究发现大型语言模型(LLMs)易受间接提示注入攻击影响,该攻击通过在执行提示上下文中注入任务,导致模型偏离用户提供的指令。此漏洞源于LLMs无法区分提示上下文中的数据与指令。本文提出CachePrune防御方法,通过识别并剪除输入提示上下文KV缓存中的任务触发神经元来抵御此类攻击。剪除这类神经元可促使LLM将输入提示上下文的文本片段仅视为纯数据,而非指令执行的指示信号。这些神经元通过特征归因法识别,所用损失函数源自直接偏好优化(DPO)目标的上界。研究表明,该损失函数仅需少量样本即可实现有效特征归因。我们进一步利用指令执行中观察到的触发效应提升特征归因质量。该方法既不改变原始提示格式,也不引入额外测试时LLM调用。实验表明CachePrune在保持响应质量的同时显著降低攻击成功率。注:本文旨在防御间接提示注入攻击,以开发更安全鲁棒的AI系统。


A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces

Abstract

arXiv:2504.21211v1 Announce Type: cross Abstract: Wildlife trafficking remains a critical global issue, significantly impacting biodiversity, ecological stability, and public health. Despite efforts to combat this illicit trade, the rise of e-commerce platforms has made it easier to sell wildlife products, putting new pressure on wild populations of endangered and threatened species. The use of these platforms also opens a new opportunity: as criminals sell wildlife products online, they leave digital traces of their activity that can provide insights into trafficking activities as well as how they can be disrupted. The challenge lies in finding these traces. Online marketplaces publish ads for a plethora of products, and identifying ads for wildlife-related products is like finding a needle in a haystack. Learning classifiers can automate ad identification, but creating them requires costly, time-consuming data labeling that hinders support for diverse ads and research questions. This paper addresses a critical challenge in the data science pipeline for wildlife trafficking analytics: generating quality labeled data for classifiers that select relevant data. While large language models (LLMs) can directly label advertisements, doing so at scale is prohibitively expensive. We propose a cost-effective strategy that leverages LLMs to generate pseudo labels for a small sample of the data and uses these labels to create specialized classification models. Our novel method automatically gathers diverse and representative samples to be labeled while minimizing the labeling costs. Our experimental evaluation shows that our classifiers achieve up to 95% F1 score, outperforming LLMs at a lower cost. We present real use cases that demonstrate the effectiveness of our approach in enabling analyses of different aspects of wildlife trafficking.

摘要

野生动物贩运仍是全球性严峻问题,对生物多样性、生态稳定和公共健康造成重大影响。尽管各国已开展打击行动,但电子商务平台的兴起使得野生动物制品交易更为便利,给濒危物种的野生种群带来新的生存压力。这些平台的使用也带来了新机遇:犯罪分子在线销售野生动物制品时,会留下可揭示贩运活动及其阻断路径的数字痕迹。核心挑战在于如何发现这些痕迹。在线市场发布的海量商品广告中,野生动物相关产品广告犹如大海捞针。学习分类器能实现广告自动识别,但其创建需要耗费高昂且耗时的数据标注工作,制约了对多样化广告及研究问题的支持能力。本文解决了野生动物贩运分析数据科学流程中的关键难题:为筛选相关数据的分类器生成高质量标注数据。虽然大语言模型(LLMs)可直接标注广告,但大规模应用成本过高。我们提出一种经济高效的策略,利用LLMs为小规模数据样本生成伪标签,再用这些标签构建专用分类模型。本创新方法能自动采集多样化且具代表性的待标注样本,同时最大限度降低标注成本。实验评估表明,我们的分类器F1分数最高达95%,以更低成本超越LLMs性能。通过实际用例,我们验证了该方法在支撑野生动物贩运多维度分析方面的有效性。


Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare

Abstract

arXiv:2504.21191v1 Announce Type: cross Abstract: This study aims to guide language model selection by investigating: 1) the necessity of finetuning versus zero-shot usage, 2) the benefits of domain-adjacent versus generic pretrained models, 3) the value of further domain-specific pretraining, and 4) the continued relevance of Small Language Models (SLMs) compared to Large Language Models (LLMs) for specific tasks. Using electronic pathology reports from the British Columbia Cancer Registry (BCCR), three classification scenarios with varying difficulty and data size are evaluated. Models include various SLMs and an LLM. SLMs are evaluated both zero-shot and finetuned; the LLM is evaluated zero-shot only. Finetuning significantly improved SLM performance across all scenarios compared to their zero-shot results. The zero-shot LLM outperformed zero-shot SLMs but was consistently outperformed by finetuned SLMs. Domain-adjacent SLMs generally performed better than the generic SLM after finetuning, especially on harder tasks. Further domain-specific pretraining yielded modest gains on easier tasks but significant improvements on the complex, data-scarce task. The results highlight the critical role of finetuning for SLMs in specialized domains, enabling them to surpass zero-shot LLM performance on targeted classification tasks. Pretraining on domain-adjacent or domain-specific data provides further advantages, particularly for complex problems or limited finetuning data. While LLMs offer strong zero-shot capabilities, their performance on these specific tasks did not match that of appropriately finetuned SLMs. In the era of LLMs, SLMs remain relevant and effective, offering a potentially superior performance-resource trade-off compared to LLMs.

摘要

本研究旨在通过以下方面指导语言模型选择:1)微调与零样本使用的必要性;2)领域相邻预训练模型相比通用模型的优势;3)领域特定预训练的附加价值;4)针对特定任务时小语言模型(SLM)相较大规模语言模型(LLM)的持续适用性。基于不列颠哥伦比亚癌症登记处(BCCR)的电子病理报告,我们评估了三种不同难度和数据规模的分类场景。实验模型包括多种SLM和一个LLM,其中SLM采用零样本和微调两种模式评估,LLM仅进行零样本评估。结果显示:与零样本相比,微调使SLM在所有场景中性能显著提升;零样本LLM虽优于零样本SLM,但始终不及微调后的SLM;领域相邻SLM经微调后通常表现优于通用SLM,尤其在困难任务上;额外领域特定预训练在简单任务上收益有限,但在数据稀缺的复杂任务中带来显著改进。研究表明:在专业领域中,微调对SLM至关重要,使其能在目标分类任务上超越零样本LLM;采用领域相邻或领域特定数据进行预训练可带来额外优势,尤其针对复杂问题或有限微调数据的情况;尽管LLM具备强大的零样本能力,但其在这些特定任务上的表现仍不及经过适当微调的SLM。在LLM时代,SLM仍具有实际应用价值,能提供优于LLM的性能-资源权衡。


Pretraining Large Brain Language Model for Active BCI: Silent Speech

Abstract

arXiv:2504.21214v1 Announce Type: cross Abstract: This paper explores silent speech decoding in active brain-computer interface (BCI) systems, which offer more natural and flexible communication than traditional BCI applications. We collected a new silent speech dataset of over 120 hours of electroencephalogram (EEG) recordings from 12 subjects, capturing 24 commonly used English words for language model pretraining and decoding. Following the recent success of pretraining large models with self-supervised paradigms to enhance EEG classification performance, we propose Large Brain Language Model (LBLM) pretrained to decode silent speech for active BCI. To pretrain LBLM, we propose Future Spectro-Temporal Prediction (FSTP) pretraining paradigm to learn effective representations from unlabeled EEG data. Unlike existing EEG pretraining methods that mainly follow a masked-reconstruction paradigm, our proposed FSTP method employs autoregressive modeling in temporal and frequency domains to capture both temporal and spectral dependencies from EEG signals. After pretraining, we finetune our LBLM on downstream tasks, including word-level and semantic-level classification. Extensive experiments demonstrate significant performance gains of the LBLM over fully-supervised and pretrained baseline models. For instance, in the difficult cross-session setting, our model achieves 47.0% accuracy on semantic-level classification and 39.6% in word-level classification, outperforming baseline methods by 5.4% and 7.3%, respectively. Our research advances silent speech decoding in active BCI systems, offering an innovative solution for EEG language model pretraining and a new dataset for fundamental research.

摘要

本文探讨了主动脑机接口(BCI)系统中的无声语音解码技术,该技术相比传统BCI应用能提供更自然灵活的交互方式。我们采集了新型无声语音数据集,包含12名受试者超过120小时的脑电图(EEG) recordings,涵盖24个常用英语单词用于语言模型预训练与解码。受近期自监督范式预训练大模型提升EEG分类性能的研究启发,我们提出专为主动BCI无声语音解码设计的大型脑语言模型(LBLM)。针对LBLM的预训练,我们创新性地提出未来时频预测(FSTP)预训练范式,从无标注EEG数据中学习有效表征。与现有主要采用掩码重建范式的EEG预训练方法不同,FSTP通过时域和频域的自回归建模同时捕捉EEG信号的时序和频谱依赖性。预训练完成后,我们在单词级和语义级分类等下游任务上对LBLM进行微调。大量实验表明,LBLM相较全监督和预训练基线模型取得显著性能提升。例如在具有挑战性的跨会话场景中,我们的模型在语义级分类达到47.0%准确率,单词级分类达39.6%,分别较基线方法提升5.4%和7.3%。本研究推动了主动BCI系统的无声语音解码技术,不仅为EEG语言模型预训练提供了创新解决方案,也为基础研究贡献了新的数据集。


Assessing LLM code generation quality through path planning tasks

Abstract

arXiv:2504.21276v1 Announce Type: cross Abstract: As LLM-generated code grows in popularity, more evaluation is needed to assess the risks of using such tools, especially for safety-critical applications such as path planning. Existing coding benchmarks are insufficient as they do not reflect the context and complexity of safety-critical applications. To this end, we assessed six LLMs' abilities to generate the code for three different path-planning algorithms and tested them on three maps of various difficulties. Our results suggest that LLM-generated code presents serious hazards for path planning applications and should not be applied in safety-critical contexts without rigorous testing.

摘要

随着LLM生成代码的日益普及,需要更多评估来考察使用此类工具的风险,特别是在路径规划等安全关键型应用中。现有编码基准测试存在不足,因其未能反映安全关键应用的上下文和复杂性。为此,我们评估了六种LLM生成三种不同路径规划算法代码的能力,并在三种不同难度地图上进行了测试。结果表明,LLM生成的代码对路径规划应用存在严重安全隐患,在未经严格测试的情况下不应应用于安全关键场景。


Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing

Abstract

arXiv:2504.21356v1 Announce Type: cross Abstract: Unified multimodal large language models (MLLMs) aim to integrate multimodal understanding and generation abilities through a single framework. Despite their versatility, existing open-source unified models exhibit performance gaps against domain-specific architectures. To bridge this gap, we present Nexus-Gen, a unified model that synergizes the language reasoning capabilities of LLMs with the image synthesis power of diffusion models. To align the embedding space of the LLM and diffusion model, we conduct a dual-phase alignment training process. (1) The autoregressive LLM learns to predict image embeddings conditioned on multimodal inputs, while (2) the vision decoder is trained to reconstruct high-fidelity images from these embeddings. During training the LLM, we identified a critical discrepancy between the autoregressive paradigm's training and inference phases, where error accumulation in continuous embedding space severely degrades generation quality. To avoid this issue, we introduce a prefilled autoregression strategy that prefills input sequence with position-embedded special tokens instead of continuous embeddings. Through dual-phase training, Nexus-Gen has developed the integrated capability to comprehensively address the image understanding, generation and editing tasks. All models, datasets, and codes are published at https://github.com/modelscope/Nexus-Gen.git to facilitate further advancements across the field.

摘要

统一多模态大语言模型(MLLMs)旨在通过单一框架整合多模态理解与生成能力。尽管现有开源统一模型具有多功能性,但其性能仍落后于领域专用架构。为弥补这一差距,我们提出Nexus-Gen——一种将大语言模型的逻辑推理能力与扩散模型的图像合成能力相协同的统一模型。为实现语言模型与扩散模型嵌入空间的对齐,我们设计了双阶段对齐训练流程:(1)自回归语言模型学习基于多模态输入预测图像嵌入;(2)视觉解码器训练从这些嵌入重建高保真图像。在语言模型训练过程中,我们发现自回归范式在训练与推理阶段存在关键差异:连续嵌入空间中的误差累积会严重降低生成质量。为此,我们提出预填充自回归策略,用位置编码的特殊标记而非连续嵌入预填充输入序列。通过双阶段训练,Nexus-Gen已具备综合处理图像理解、生成与编辑任务的集成能力。所有模型、数据集及代码均已发布于https://github.com/modelscope/Nexus-Gen.git以推动该领域进一步发展。


Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction

Abstract

arXiv:2504.21372v1 Announce Type: cross Abstract: Speech Event Extraction (SpeechEE) is a challenging task that lies at the intersection of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP), requiring the identification of structured event information from spoken language. In this work, we present a modular, pipeline-based SpeechEE framework that integrates high-performance ASR with semantic search-enhanced prompting of Large Language Models (LLMs). Our system first classifies speech segments likely to contain events using a hybrid filtering mechanism including rule-based, BERT-based, and LLM-based models. It then employs few-shot LLM prompting, dynamically enriched via semantic similarity retrieval, to identify event triggers and extract corresponding arguments. We evaluate the pipeline using multiple LLMs (Llama3-8B, GPT-4o-mini, and o1-mini) highlighting significant performance gains with o1-mini, which achieves 63.3% F1 on trigger classification and 27.8% F1 on argument classification, outperforming prior benchmarks. Our results demonstrate that pipeline approaches, when empowered by retrieval-augmented LLMs, can rival or exceed end-to-end systems while maintaining interpretability and modularity. This work provides practical insights into LLM-driven event extraction and opens pathways for future hybrid models combining textual and acoustic features.

摘要

语音事件抽取(SpeechEE)是自动语音识别(ASR)与自然语言处理(NLP)交叉领域的挑战性任务,需要从口语中识别结构化事件信息。本研究提出一种模块化的管道式SpeechEE框架,将高性能ASR与基于语义搜索增强的大语言模型(LLM)提示相结合。该系统首先通过混合过滤机制(包括基于规则、基于BERT和基于LLM的模型)分类可能包含事件的语音片段,随后采用小样本LLM提示技术(通过语义相似性检索动态增强)来识别事件触发词并抽取相应论元。我们使用多种LLM(Llama3-8B、GPT-4o-mini和o1-mini)评估该管道系统,其中o1-mini表现突出,在触发词分类和论元分类上分别达到63.3%和27.8%的F1值,超越现有基准。结果表明:当配备检索增强型LLM时,管道方法在保持可解释性与模块化的同时,其性能可媲美甚至超越端到端系统。本研究为LLM驱动的事件抽取提供了实践洞见,并为结合文本与声学特征的未来混合模型开辟了路径。


SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding

Abstract

arXiv:2504.21435v1 Announce Type: cross Abstract: With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus on \textbfstandalone videos and mainly assess ``visual elements'' like human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a \textbfseries. To address this challenge, we propose \textbfSeriesBench, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance model capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, \textbfPC-DCoT. Extensive results on \textbfSeriesBench indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while \textbfPC-DCoT enables these MLLMs to achieve performance improvements. Overall, our \textbfSeriesBench and \textbfPC-DCoT highlight the critical necessity of advancing model capabilities to understand narrative-driven series, guiding the future development of MLLMs. SeriesBench is publicly available at https://github.com/zackhxn/SeriesBench-CVPR2025.

摘要

随着多模态大语言模型(MLLMs)的快速发展,越来越多的基准测试被建立以评估这些模型的视频理解能力。然而,这些基准测试主要关注独立视频,并侧重于评估人类行为和物体状态等“视觉元素”。实际上,现代视频通常包含复杂且连续的叙事,通常以系列形式呈现。为应对这一挑战,我们提出了SeriesBench,这是一个由105个精心策划的叙事驱动系列组成的基准测试,涵盖28项需要深度叙事理解的专业任务。具体而言,我们首先选取了涵盖多种类型的多样化剧集系列。随后,我们引入了一种新颖的长跨度叙事标注方法,结合全信息转换技术,将人工标注转化为多样化的任务格式。为进一步增强模型对系列中情节结构和角色关系的详细分析能力,我们提出了一种新颖的叙事推理框架PC-DCoT。在SeriesBench上的大量实验结果表明,现有MLLMs在理解叙事驱动系列时仍面临重大挑战,而PC-DCoT能使这些模型实现性能提升。总体而言,我们的SeriesBench和PC-DCoT强调了提升模型理解叙事驱动系列能力的关键必要性,为MLLMs的未来发展提供了指导。SeriesBench已公开于https://github.com/zackhxn/SeriesBench-CVPR2025。


Rethinking Visual Layer Selection in Multimodal LLMs

Abstract

arXiv:2504.21447v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have achieved impressive performance across a wide range of tasks, typically using CLIP-ViT as their visual encoder due to its strong text-image alignment capabilities. While prior studies suggest that different CLIP-ViT layers capture different types of information, with shallower layers focusing on fine visual details and deeper layers aligning more closely with textual semantics, most MLLMs still select visual features based on empirical heuristics rather than systematic analysis. In this work, we propose a Layer-wise Representation Similarity approach to group CLIP-ViT layers with similar behaviors into shallow, middle, and deep categories and assess their impact on MLLM performance. Building on this foundation, we revisit the visual layer selection problem in MLLMs at scale, training LLaVA-style models ranging from 1.4B to 7B parameters. Through extensive experiments across 10 datasets and 4 tasks, we find that: (1) deep layers are essential for OCR tasks; (2) shallow and middle layers substantially outperform deep layers on reasoning tasks involving counting, positioning, and object localization; (3) a lightweight fusion of features across shallow, middle, and deep layers consistently outperforms specialized fusion baselines and single-layer selections, achieving gains on 9 out of 10 datasets. Our work offers the first principled study of visual layer selection in MLLMs, laying the groundwork for deeper investigations into visual representation learning for MLLMs.

摘要

多模态大语言模型(MLLMs)在各类任务中展现出卓越性能,其视觉编码器通常采用具有强大图文对齐能力的CLIP-ViT。尽管已有研究表明CLIP-ViT不同层级捕获的信息类型存在差异——浅层侧重精细视觉细节,深层更贴合文本语义,但多数MLLMs仍基于经验启发式而非系统分析来选择视觉特征。本研究提出分层表征相似性方法,将行为相似的CLIP-ViT层级归类为浅层、中层、深层,并评估其对MLLM性能的影响。在此基础上,我们大规模重新审视MLLMs中的视觉层级选择问题,训练了参数规模从14亿到70亿不等的LLaVA架构模型。通过10个数据集和4类任务的广泛实验发现:(1)深层特征对OCR任务至关重要;(2)在涉及计数、定位和物体检测的推理任务中,浅层与中层表现显著优于深层;(3)跨浅、中、深层的轻量级特征融合方案始终优于专用融合基线和单层选择,在10个数据集中的9个实现性能提升。本研究首次为MLLMs视觉层级选择提供了系统性分析框架,为深入探索MLLMs视觉表征学习奠定基础。


Generative AI in Financial Institution: A Global Survey of Opportunities, Threats, and Regulation

Abstract

arXiv:2504.21574v1 Announce Type: cross Abstract: Generative Artificial Intelligence (GenAI) is rapidly reshaping the global financial landscape, offering unprecedented opportunities to enhance customer engagement, automate complex workflows, and extract actionable insights from vast financial data. This survey provides an overview of GenAI adoption across the financial ecosystem, examining how banks, insurers, asset managers, and fintech startups worldwide are integrating large language models and other generative tools into their operations. From AI-powered virtual assistants and personalized financial advisory to fraud detection and compliance automation, GenAI is driving innovation across functions. However, this transformation comes with significant cybersecurity and ethical risks. We discuss emerging threats such as AI-generated phishing, deepfake-enabled fraud, and adversarial attacks on AI systems, as well as concerns around bias, opacity, and data misuse. The evolving global regulatory landscape is explored in depth, including initiatives by major financial regulators and international efforts to develop risk-based AI governance. Finally, we propose best practices for secure and responsible adoption - including explainability techniques, adversarial testing, auditability, and human oversight. Drawing from academic literature, industry case studies, and policy frameworks, this chapter offers a perspective on how the financial sector can harness GenAI's transformative potential while navigating the complex risks it introduces.

摘要

生成式人工智能(GenAI)正在快速重塑全球金融格局,为提升客户参与度、自动化复杂工作流程以及从海量金融数据中提取可执行洞察提供了前所未有的机遇。本综述系统梳理了GenAI在金融生态系统中的应用现状,考察全球范围内银行、保险公司、资产管理公司和金融科技初创企业如何将大语言模型及其他生成式工具整合至业务运营中。从AI驱动的虚拟助手与个性化财务咨询,到欺诈检测与合规自动化,GenAI正在推动各职能领域的创新。然而这种转型伴随着重大的网络安全与伦理风险。我们探讨了AI生成钓鱼攻击、深度伪造欺诈及针对AI系统的对抗性攻击等新兴威胁,以及算法偏见、系统不透明性和数据滥用等问题。研究深入分析了全球监管态势的演变,包括主要金融监管机构的政策举措和国际社会建立风险导向型AI治理框架的努力。最后,我们提出了安全负责的应用实践建议——涵盖可解释性技术、对抗性测试、可审计性及人工监督机制。基于学术文献、行业案例和政策框架,本章为金融业如何在把握GenAI变革潜力的同时应对其带来的复杂风险提供了前瞻性视角。


DNB-AI-Project at SemEval-2025 Task 5: An LLM-Ensemble Approach for Automated Subject Indexing

Abstract

arXiv:2504.21589v1 Announce Type: cross Abstract: This paper presents our system developed for the SemEval-2025 Task 5: LLMs4Subjects: LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog. Our system relies on prompting a selection of LLMs with varying examples of intellectually annotated records and asking the LLMs to similarly suggest keywords for new records. This few-shot prompting technique is combined with a series of post-processing steps that map the generated keywords to the target vocabulary, aggregate the resulting subject terms to an ensemble vote and, finally, rank them as to their relevance to the record. Our system is fourth in the quantitative ranking in the all-subjects track, but achieves the best result in the qualitative ranking conducted by subject indexing experts.

摘要

本文介绍了为SemEval-2025任务5开发的系统:LLMs4Subjects——基于LLM的国家技术图书馆开放获取目录自动化主题标引系统。该系统通过向选定的多个大语言模型提供经过人工标注的记录示例,要求模型为新记录推荐类似关键词。这种少样本提示技术结合了一系列后处理步骤:将生成关键词映射至目标词表、通过集成投票汇总主题词项、最后根据其与记录的相关性进行排序。在全部主题赛道的量化排名中,本系统位列第四,但在由主题标引专家进行的质性评估中获得了最佳成绩。


Leveraging Pre-trained Large Language Models with Refined Prompting for Online Task and Motion Planning

Abstract

arXiv:2504.21596v1 Announce Type: cross Abstract: With the rapid advancement of artificial intelligence, there is an increasing demand for intelligent robots capable of assisting humans in daily tasks and performing complex operations. Such robots not only require task planning capabilities but must also execute tasks with stability and robustness. In this paper, we present a closed-loop task planning and acting system, LLM-PAS, which is assisted by a pre-trained Large Language Model (LLM). While LLM-PAS plans long-horizon tasks in a manner similar to traditional task and motion planners, it also emphasizes the execution phase of the task. By transferring part of the constraint-checking process from the planning phase to the execution phase, LLM-PAS enables exploration of the constraint space and delivers more accurate feedback on environmental anomalies during execution. The reasoning capabilities of the LLM allow it to handle anomalies that cannot be addressed by the robust executor. To further enhance the system's ability to assist the planner during replanning, we propose the First Look Prompting (FLP) method, which induces LLM to generate effective PDDL goals. Through comparative prompting experiments and systematic experiments, we demonstrate the effectiveness and robustness of LLM-PAS in handling anomalous conditions during task execution.

摘要

随着人工智能技术的快速发展,人们对能够协助人类完成日常任务和执行复杂操作的智能机器人需求日益增长。这类机器人不仅需要具备任务规划能力,还必须以稳定和鲁棒的方式执行任务。本文提出了一种由预训练大语言模型(LLM)辅助的闭环任务规划与执行系统LLM-PAS。该系统在采用类似传统任务与运动规划器的方式进行长周期任务规划的同时,更注重任务的执行阶段。通过将部分约束检查过程从规划阶段转移到执行阶段,LLM-PAS能够探索约束空间,并在执行过程中对环境异常提供更精确的反馈。大语言模型的推理能力使其能够处理鲁棒执行器无法应对的异常情况。为进一步增强系统在重新规划时辅助规划器的能力,我们提出了"初览提示"(FLP)方法,该方法可引导LLM生成有效的PDDL目标。通过对比提示实验和系统性实验,我们验证了LLM-PAS在处理任务执行过程中异常情况的有效性和鲁棒性。


RDF-Based Structured Quality Assessment Representation of Multilingual LLM Evaluations

Abstract

arXiv:2504.21605v1 Announce Type: cross Abstract: Large Language Models (LLMs) increasingly serve as knowledge interfaces, yet systematically assessing their reliability with conflicting information remains difficult. We propose an RDF-based framework to assess multilingual LLM quality, focusing on knowledge conflicts. Our approach captures model responses across four distinct context conditions (complete, incomplete, conflicting, and no-context information) in German and English. This structured representation enables the comprehensive analysis of knowledge leakage-where models favor training data over provided context-error detection, and multilingual consistency. We demonstrate the framework through a fire safety domain experiment, revealing critical patterns in context prioritization and language-specific performance, and demonstrating that our vocabulary was sufficient to express every assessment facet encountered in the 28-question study.

摘要

大型语言模型(LLMs)日益成为知识交互界面,但系统评估其在冲突信息下的可靠性仍具挑战性。我们提出一个基于RDF的框架来评估多语言LLM质量,重点关注知识冲突场景。该方法捕获模型在德语和英语四种不同上下文条件(完整、不完整、冲突及无上下文信息)下的响应,通过结构化表征实现知识泄漏(模型倾向训练数据而非提供上下文)、错误检测及多语言一致性的综合分析。我们以消防安全领域实验验证该框架,揭示了上下文优先级和语言特异性表现的关键模式,并证明本研究的28问评估中,所构建词汇足以表达所有遇到的评估维度。


Sadeed: Advancing Arabic Diacritization Through Small Language Model

Abstract

arXiv:2504.21635v1 Announce Type: cross Abstract: Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.

摘要

阿拉伯语文本标注由于该语言形态丰富的特性,始终是自然语言处理领域的一项持续挑战。本文提出Sadeed——一种基于Kuwain 1.5B Hennara等人[2025]微调的解码器专用语言模型的新方法,该紧凑模型最初在多样化阿拉伯语语料库上训练完成。Sadeed通过严格的数据清洗与标准化流程构建的高质量标注数据集进行微调,尽管采用适度计算资源,其性能仍可媲美专有大型语言模型,并超越同类领域训练的传统模型。此外,我们重点指出了当前阿拉伯语标注基准测试实践中的关键局限。为解决这些问题,我们推出SadeedDiac-25新基准,旨在实现对不同文本类型与复杂度层级更公平、更全面的评估。Sadeed与SadeedDiac-25共同为推进阿拉伯语自然语言处理应用(包括机器翻译、文本转语音及语言学习工具)奠定了坚实基础。


XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs

Abstract

arXiv:2504.21700v1 Announce Type: cross Abstract: Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions. However, security threats associated with them might prevent their reliable adoption in critical application scenarios such as government organizations and medical institutions. For this reason, commercial LLMs typically undergo a sophisticated censoring mechanism to eliminate any harmful output they could possibly produce. In response to this, LLM Jailbreaking is a significant threat to such protections, and many previous approaches have already demonstrated its effectiveness across diverse domains. Existing jailbreak proposals mostly adopt a generate-and-test strategy to craft malicious input. To improve the comprehension of censoring mechanisms and design a targeted jailbreak attack, we propose an Explainable-AI solution that comparatively analyzes the behavior of censored and uncensored models to derive unique exploitable alignment patterns. Then, we propose XBreaking, a novel jailbreak attack that exploits these unique patterns to break the security constraints of LLMs by targeted noise injection. Our thorough experimental campaign returns important insights about the censoring mechanisms and demonstrates the effectiveness and performance of our attack.

摘要

大型语言模型已成为人工智能解决方案主导的现代信息技术领域的基础性存在。然而与之相关的安全威胁可能阻碍其在政府机构和医疗机构等关键应用场景中的可靠采用。为此,商业级大型语言模型通常需经过复杂的审查机制处理,以消除其可能产生的任何有害输出。针对这种防护措施,语言模型越狱技术构成了重大威胁,先前诸多研究方法已证实其在多个领域的有效性。现有越狱方案主要采用生成-测试策略来构造恶意输入。为深化对审查机制的理解并设计针对性越狱攻击,我们提出一种可解释人工智能解决方案,通过对比分析审查版与未审查模型的行为特征,提取可供利用的独特对齐模式。基于此,我们提出XBreaking——一种新型越狱攻击技术,该技术通过定向噪声注入来利用这些独特模式突破大型语言模型的安全约束。我们全面的实验研究揭示了关于审查机制的重要发现,并验证了本攻击方法的有效性与性能表现。


LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Abstract

arXiv:2504.21716v1 Announce Type: cross Abstract: We present an embodied robotic system with an LLM-driven agent-orchestration architecture for autonomous household object management. The system integrates memory-augmented task planning, enabling robots to execute high-level user commands while tracking past actions. It employs three specialized agents: a routing agent, a task planning agent, and a knowledge base agent, each powered by task-specific LLMs. By leveraging in-context learning, our system avoids the need for explicit model training. RAG enables the system to retrieve context from past interactions, enhancing long-term object tracking. A combination of Grounded SAM and LLaMa3.2-Vision provides robust object detection, facilitating semantic scene understanding for task planning. Evaluation across three household scenarios demonstrates high task planning accuracy and an improvement in memory recall due to RAG. Specifically, Qwen2.5 yields best performance for specialized agents, while LLaMA3.1 excels in routing tasks. The source code is available at: https://github.com/marc1198/chat-hsr.

摘要

我们提出了一种具身机器人系统,采用基于大语言模型(LLM)的智能体编排架构,用于自主管理家居物品。该系统集成了记忆增强的任务规划功能,使机器人能够执行高级用户指令并追踪历史操作。系统部署了三个专用智能体:路由智能体、任务规划智能体和知识库智能体,每个智能体均由任务专用LLM驱动。通过情境学习技术,本系统避免了显式模型训练的需求。检索增强生成(RAG)技术使系统能够从过往交互中检索上下文,从而增强长期物品追踪能力。结合Grounded SAM和LLaMa3.2-Vision的混合方案提供了鲁棒的物体检测功能,为任务规划实现语义场景理解。在三种家居场景中的评估表明,该系统具有较高的任务规划准确率,且RAG技术显著提升了记忆召回性能。具体而言,Qwen2.5在专用智能体任务中表现最优,而LLaMA3.1在路由任务中性能突出。


MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness

Abstract

arXiv:2504.21773v1 Announce Type: cross Abstract: With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.

摘要

随着大语言模型(LLMs)的广泛应用,其生成虚构事实(即幻觉)的问题日益受到关注。现有研究在提升LLM置信度估计方面主要集中于单一问题场景,而对更具挑战性的多问题场景下(需同时准确回答多个问题)模型对其内部参数化知识边界的认知仍缺乏深入探索。为此,我们提出了一种新方法——多答案与置信度分步调优(MAC-Tuning),该方法在指令数据微调过程中将答案预测与置信度估计的学习过程分离。大量实验表明,我们的方法在平均精确率上最高可超越基线方法25%。


WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Abstract

arXiv:2504.21776v1 Announce Type: cross Abstract: Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose \textbfWebThinker, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a \textbfDeep Web Explorer module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an \textbfAutonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an \textbfRL-based training strategy via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.

摘要

大型推理模型(LRMs),如OpenAI-o1和DeepSeek-R1,展现出卓越的长程推理能力。然而,其对静态内部知识的依赖限制了其在复杂、知识密集型任务上的表现,并阻碍了其生成需要综合多样化网络信息的全面研究报告的能力。为解决这一问题,我们提出了WebThinker,一种深度研究智能体,赋予LRMs在推理过程中自主搜索网络、浏览网页和起草研究报告的能力。WebThinker集成了一个深度网络探索器模块,使LRMs在遇到知识缺口时能够动态搜索、浏览并从网络中提取信息。它还采用了一种自主思考-搜索-起草策略,允许模型实时无缝地交替进行推理、信息收集和报告撰写。为进一步提升研究工具的利用率,我们通过迭代在线直接偏好优化(DPO)引入了一种基于强化学习的训练策略。在复杂推理基准测试(GPQA、GAIA、WebWalkerQA、HLE)和科学报告生成任务(Glaive)上的大量实验表明,WebThinker显著优于现有方法和强大的专有系统。我们的方法增强了LRM在复杂场景中的可靠性和适用性,为更强大、更通用的深度研究系统铺平了道路。代码可在https://github.com/RUC-NLPIR/WebThinker获取。


SWE-smith: Scaling Data for Software Engineering Agents

Abstract

arXiv:2504.21798v1 Announce Type: cross Abstract: Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.

摘要

尽管语言模型在软件工程领域取得进展,数据收集仍是主要痛点。现有数据集规模有限,最多仅包含来自11个以下GitHub仓库的数千条训练实例。这些数据集的构建流程通常复杂,需耗费数百小时人工;配套执行环境更占用数TB存储空间,严重制约其扩展性与可用性。为解决该问题,我们提出SWE-smith——一种规模化生成软件工程训练数据的新方法。给定任意Python代码库,SWE-smith能构建对应执行环境,并自动合成数百至数千个破坏现有测试用例的任务实例。通过该方法,我们创建了包含128个GitHub仓库5万实例的数据集,规模超先前工作一个数量级。据此训练的SWE-agent-LM-32B模型在SWE-bench Verified基准测试中达到40.2%的Pass@1解决率,成为开源模型中的最优结果。我们开源SWE-smith全套资源(采集流程、任务实例、轨迹记录及模型),以降低自动化软件工程语言模型系统的研究门槛。所有资源详见https://swesmith.com。


DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Abstract

arXiv:2504.21801v1 Announce Type: cross Abstract: We introduce DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3's step-by-step reasoning, to create an initial cold start for reinforcement learning. This process enables us to integrate both informal and formal mathematical reasoning into a unified model. The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench. In addition to standard benchmarks, we introduce ProverBench, a collection of 325 formalized problems, to enrich our evaluation, including 15 selected problems from the recent AIME competitions (years 24-25). Further evaluation on these 15 AIME problems shows that the model successfully solves 6 of them. In comparison, DeepSeek-V3 solves 8 of these problems using majority voting, highlighting that the gap between formal and informal mathematical reasoning in large language models is substantially narrowing.

摘要

我们推出DeepSeek-Prover-V2,这是一个专为Lean 4形式化定理证明设计的开源大语言模型,其初始化数据通过基于DeepSeek-V3的递归定理证明流程采集。冷启动训练过程首先引导DeepSeek-V3将复杂问题分解为系列子目标,并将已解决子目标的证明与模型逐步推理结合,构建思维链以形成强化学习的初始冷启动。该方法使我们能够将非形式化与形式化数学推理统一整合至单一模型中。最终模型DeepSeek-Prover-V2-671B在神经定理证明领域取得最先进性能,于MiniF2F-test测试集达到88.9%通过率,并在PutnamBench的658道问题中成功求解49道。除标准测试基准外,我们提出包含325个形式化问题的ProverBench评估集以丰富评测维度,其中包括从近期AIME竞赛(第24-25届)精选的15道题目。针对这15道AIME问题的进一步评估显示,模型成功求解其中6道。作为对比,DeepSeek-V3通过多数表决机制解决其中8道,这表明大语言模型中形式化与非形式化数学推理能力的差距正在显著缩小。


Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization

Abstract

arXiv:2504.21831v1 Announce Type: cross Abstract: We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.

摘要

我们提出DEEVISum(基于蒸馏早退机制的视觉语言视频摘要模型),这是一种轻量化、高效且可扩展的视觉语言模型,专为分段式视频摘要任务设计。通过融合文本与音频信号的多模态提示,该模型采用多阶段知识蒸馏(MSKD)和早退机制(EE)来实现性能与效率的平衡。实验表明,MSKD相比基线蒸馏方法(0.5%)带来1.33%的绝对F1值提升,而EE机制在F1值仅下降1.3分的同时使推理时间减少约21%。在TVSum数据集上的评估显示,我们的最佳模型PaLI Gemma2 3B + MSKD取得了61.1的F1分数,其性能可与规模显著更大的模型相媲美,同时始终保持较低的计算开销。我们公开了代码和经处理的数据集以支持后续研究。


TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

Abstract

arXiv:2504.21851v1 Announce Type: cross Abstract: Objectives: While Large Language Models (LLMs) have been widely used to assist clinicians and support patients, no existing work has explored dialogue systems for standard diagnostic interviews and assessments. This study aims to bridge the gap in mental healthcare accessibility by developing an LLM-powered dialogue system that replicates clinician behavior. Materials and Methods: We introduce TRUST, a framework of cooperative LLM modules capable of conducting formal diagnostic interviews and assessments for Post-Traumatic Stress Disorder (PTSD). To guide the generation of appropriate clinical responses, we propose a Dialogue Acts schema specifically designed for clinical interviews. Additionally, we develop a patient simulation approach based on real-life interview transcripts to replace time-consuming and costly manual testing by clinicians. Results: A comprehensive set of evaluation metrics is designed to assess the dialogue system from both the agent and patient simulation perspectives. Expert evaluations by conversation and clinical specialists show that TRUST performs comparably to real-life clinical interviews. Discussion: Our system performs at the level of average clinicians, with room for future enhancements in communication styles and response appropriateness. Conclusions: Our TRUST framework shows its potential to facilitate mental healthcare availability.

摘要

目的:尽管大语言模型(LLMs)已广泛应用于辅助临床医生和支持患者,但现有研究尚未探索用于标准化诊断性访谈与评估的对话系统。本研究旨在通过开发模拟临床医生行为的LLM驱动对话系统,弥合心理健康服务可及性缺口。材料与方法:我们提出TRUST框架——一个由协作LLM模块组成的系统,能够执行创伤后应激障碍(PTSD)的规范化诊断访谈与评估。为引导生成恰当的临床应答,我们专门设计了适用于临床访谈的对话行为模式。此外,基于真实访谈记录开发了患者模拟方法,以替代耗时且成本高昂的临床医生人工测试。结果:设计了一套综合评估指标,分别从代理系统和患者模拟视角评估对话系统。会话分析与临床专家的评估表明,TRUST的表现与真实临床访谈相当。讨论:本系统达到普通临床医生水平,在沟通风格与应答适当性方面仍有提升空间。结论:TRUST框架展现出促进心理健康服务可及性的潜力。


MedPix 2.0: A Comprehensive Multimodal Biomedical Data set for Advanced AI Applications

Abstract

arXiv:2407.02994v4 Announce Type: replace Abstract: The increasing interest in developing Artificial Intelligence applications in the medical domain, suffers from the lack of high-quality data set, mainly due to privacy-related issues. Moreover, the recent rising of Large Multimodal Models (LMM) leads to a need for multimodal medical data sets, where clinical reports and findings are attached to the corresponding CT or MR scans. This paper illustrates the entire workflow for building the data set MedPix 2.0. Starting from the well-known multimodal data set MedPix, mainly used by physicians, nurses and healthcare students for Continuing Medical Education purposes, a semi-automatic pipeline was developed to extract visual and textual data followed by a manual curing procedure where noisy samples were removed, thus creating a MongoDB database. Along with the data set, we developed a GUI aimed at navigating efficiently the MongoDB instance, and obtaining the raw data that can be easily used for training and/or fine-tuning LMMs. To enforce this point, we also propose a CLIP-based model trained on MedPix 2.0 for scanning modality and location classification tasks. MedPix 2.0 is available on GitHub

摘要

随着人工智能在医疗领域应用开发的日益增长,高质量数据集的匮乏成为主要障碍,这主要源于隐私相关问题。此外,大型多模态模型(LMM)的兴起导致对多模态医疗数据集的需求增加,这类数据集需要将临床报告和检查结果与对应的CT或MR扫描图像关联。本文详细阐述了构建MedPix 2.0数据集的全流程。该工作始于被医师、护士和医学生广泛用于继续医学教育的知名多模态数据集MedPix,我们开发了半自动化流程来提取视觉与文本数据,随后通过人工校验程序去除噪声样本,最终构建了MongoDB数据库。除数据集外,我们还开发了用于高效浏览MongoDB实例的图形界面,并支持获取可直接用于LMM训练/微调的原始数据。为验证其实用性,我们提出基于MedPix 2.0训练的CLIP模型,用于扫描模态和部位分类任务。MedPix 2.0已在GitHub平台开源。


Retrieval, Reasoning, Re-ranking: A Context-Enriched Framework for Knowledge Graph Completion

Abstract

arXiv:2411.08165v2 Announce Type: replace Abstract: The Knowledge Graph Completion~(KGC) task aims to infer the missing entity from an incomplete triple. Existing embedding-based methods rely solely on triples in the KG, which is vulnerable to specious relation patterns and long-tail entities. On the other hand, text-based methods struggle with the semantic gap between KG triples and natural language. Apart from triples, entity contexts (e.g., labels, descriptions, aliases) also play a significant role in augmenting KGs. To address these limitations, we propose KGR3, a context-enriched framework for KGC. KGR3 is composed of three modules. Firstly, the Retrieval module gathers supporting triples from the KG, collects plausible candidate answers from a base embedding model, and retrieves context for each related entity. Then, the Reasoning module employs a large language model to generate potential answers for each query triple. Finally, the Re-ranking module combines candidate answers from the two modules mentioned above, and fine-tunes an LLM to provide the best answer. Extensive experiments on widely used datasets demonstrate that KGR3 consistently improves various KGC methods. Specifically, the best variant of KGR3 achieves absolute Hits@1 improvements of 12.3% and 5.6% on the FB15k237 and WN18RR datasets.

摘要

知识图谱补全(KGC)任务旨在从不完整三元组中推断缺失实体。现有基于嵌入的方法仅依赖知识图谱中的三元组,容易受到虚假关系模式与长尾实体的影响。基于文本的方法则面临知识图谱三元组与自然语言间语义鸿沟的挑战。除三元组外,实体上下文(如标签、描述、别名)对增强知识图谱同样具有重要作用。为克服这些局限性,我们提出KGR3这一面向KGC任务的上下文增强框架。该框架包含三个核心模块:检索模块从知识图谱中收集支持性三元组,通过基础嵌入模型获取候选答案,并为相关实体检索上下文;推理模块利用大语言模型为每个查询三元组生成潜在答案;重排序模块整合上述模块的候选答案,并通过微调大语言模型输出最优解。在多个基准数据集上的实验表明,KGR3能持续提升各类KGC方法的性能。具体而言,KGR3的最佳变体在FB15k237和WN18RR数据集上分别实现了12.3%和5.6%的Hits@1绝对值提升。


APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving

Abstract

arXiv:2411.17651v2 Announce Type: replace Abstract: Efficiently serving Large Language Models (LLMs) requires selecting an optimal parallel execution plan, balancing computation, memory, and communication overhead. However, determining the best strategy is challenging due to varying parallelism techniques (data, pipeline, tensor) and workload characteristics (e.g., compute-intensive tasks with long prompts vs. memory-intensive tasks with long generation). We propose APEX, an LLM serving system simulator that efficiently identifies optimal parallel execution plans by considering key factors of LLM serving systems, such as memory usage, batching behavior, etc. APEX performs dynamism-aware simulation to model iteration-level batching, and leverages LLMs' repetitive structure to reduce design space, scaling efficiently to trillion-scale models. APEX abstracts the key components of LLM serving systems, including the model, batching module, quantization formats, and device clusters, enabling the simulator to be general and extensible. Simulating on a CPU, APEX evaluates execution plans for various device clusters, covering diverse LLMs and workloads. APEX finds plans up to 3.37x faster than heuristics, and also plans that reduce energy consumption by up to 45% compared to latency-optimal plans. APEX performs comprehensive evaluations, reporting key system metrics like time per output token and time to first token, which can help service providers meet SLOs. APEX identifies an optimal plan within 15 minutes on a CPU, making it 71x faster and 1234x more cost-effective than cloud-based GPU deployment. APEX can be accessed at https://github.com/microsoft/apex_plus

摘要

高效服务大型语言模型(LLM)需要选择最优并行执行方案,以平衡计算、内存和通信开销。然而,由于并行技术(数据、流水线、张量)和工作负载特征(如长提示的计算密集型任务与长生成的内存密集型任务)的多样性,确定最佳策略具有挑战性。我们提出APEX——一个通过考虑LLM服务系统的内存使用、批处理行为等关键因素来高效识别最优并行执行方案的模拟系统。APEX执行动态感知模拟以建模迭代级批处理,并利用LLM的重复结构缩减设计空间,可高效扩展至万亿规模模型。该系统抽象了LLM服务系统的核心组件,包括模型、批处理模块、量化格式和设备集群,使模拟器具备通用性和可扩展性。在CPU上运行时,APEX能针对不同设备集群评估执行方案,覆盖多样化的LLM和工作负载。实验表明:APEX发现的方案比启发式方法快3.37倍,相比延迟最优方案可降低45%能耗。该系统提供包括单输出令牌时间和首令牌时间在内的关键指标评估,有助于服务商满足SLO要求。APEX在CPU上15分钟内即可确定最优方案,其速度比云端GPU部署快71倍,成本效益高1234倍。项目地址:https://github.com/microsoft/apex_plus


Mastering Board Games by External and Internal Planning with Language Models

Abstract

arXiv:2412.12119v2 Announce Type: replace Abstract: Advancing planning and reasoning capabilities of Large Language Models (LLMs) is one of the key prerequisites towards unlocking their potential for performing reliably in complex and impactful domains. In this paper, we aim to demonstrate this across board games (Chess, Fischer Random / Chess960, Connect Four, and Hex), and we show that search-based planning can yield significant improvements in LLM game-playing strength. We introduce, compare and contrast two major approaches: In external search, the model guides Monte Carlo Tree Search (MCTS) rollouts and evaluations without calls to an external game engine, and in internal search, the model is trained to generate in-context a linearized tree of search and a resulting final choice. Both build on a language model pre-trained on relevant domain knowledge, reliably capturing the transition and value functions in the respective environments, with minimal hallucinations. We evaluate our LLM search implementations against game-specific state-of-the-art engines, showcasing substantial improvements in strength over the base model, and reaching Grandmaster-level performance in chess while operating closer to the human search budget. Our proposed approach, combining search with domain knowledge, is not specific to board games, hinting at more general future applications.

摘要

提升大语言模型(LLMs)的规划与推理能力,是释放其在复杂关键领域可靠应用潜力的关键前提。本文通过棋盘游戏(国际象棋、菲舍尔随机象棋/Chess960、四子棋及六边形棋)验证了这一观点,并证明基于搜索的规划能显著增强LLM的游戏对弈水平。我们提出并对比两种主要方法:外部搜索中,模型无需调用外部游戏引擎即可指导蒙特卡洛树搜索(MCTS)的推演与评估;内部搜索中,模型通过训练生成上下文相关的线性化搜索树并输出最终决策。两种方法均基于预训练的语言模型构建,该模型能可靠捕捉对应环境中的状态转移与价值函数,且幻觉现象极少。我们将LLM搜索方案与各游戏领域的最先进引擎对比,结果显示其性能较基础模型有显著提升,在国际象棋中达到特级大师水平,同时搜索预算更接近人类。这种结合领域知识与搜索的方法不仅限于棋盘游戏,为未来更广泛的通用应用提供了可能。


MoEtion: Efficient and Reliable Sparse Checkpointing for Mixture-of-Experts Models at Scale

Abstract

arXiv:2412.15411v2 Announce Type: replace Abstract: As large language models continue to scale, training them requires thousands of GPUs over prolonged durations--making frequent failures an inevitable reality. While checkpointing remains the primary fault-tolerance mechanism, existing methods struggle to efficiently support Mixture-of-Experts (MoE) models. Due to the substantially larger training state of MoE models, traditional checkpointing techniques incur prohibitive overheads, resulting in frequent stalls or prolonged recovery periods that severely degrade training efficiency. We introduce MoEtion, a distributed, in-memory checkpointing system designed explicitly for MoE models. MoEtion builds on three key ideas: (1) sparse checkpointing, which incrementally checkpoints subsets of experts over multiple iterations, significantly reducing snapshot overhead; (2) a sparse-to-dense checkpoint conversion technique that incrementally reconstructs temporally consistent checkpoints from sparse snapshots; and (3) lightweight upstream logging activations and gradients at pipeline-stage boundaries to localize recovery of failed workers without redundant recomputation of unaffected workers. Evaluations across diverse MoE models with up to 64 experts demonstrate that MoEtion reduces checkpointing overhead by up to 4×4\times and recovery overhead by up to 31×31\times compared to state-of-the-art approaches, achieving consistently high Effective Training Time Ratios (ETTR) of up to 0.980.98, even under frequent failures (MTBF as low as 20 minutes) without compromising synchronous training semantics. Overall, MoEtion offers a practical, scalable, and robust fault-tolerance solution for the next generation of sparsely activated models.

摘要

随着大型语言模型规模持续扩大,其训练过程需要数千个GPU长时间运行——这使得频繁故障成为不可避免的现实。虽然检查点仍是主要的容错机制,但现有方法难以高效支持混合专家(MoE)模型。由于MoE模型的训练状态规模显著更大,传统检查点技术会产生过高开销,导致频繁停顿或漫长恢复周期,严重降低训练效率。

我们提出MoEtion,一个专为MoE模型设计的分布式内存检查点系统。该系统基于三个关键创新:(1)稀疏检查点技术,通过多轮迭代增量式保存专家子集,显著降低快照开销;(2)稀疏-稠密检查点转换技术,从稀疏快照逐步重建时序一致的检查点;(3)在流水线阶段边界轻量级记录上游激活和梯度,实现故障工作节点的局部化恢复,避免未受影响节点的冗余重计算。在包含多达64个专家的多种MoE模型上的实验表明,相比最先进方案,MoEtion将检查点开销降低达4倍,恢复开销降低达31倍,即使在频繁故障(平均故障间隔时间低至20分钟)情况下仍能保持高达0.98的有效训练时间比(ETTR),且不违反同步训练语义。总体而言,MoEtion为下一代稀疏激活模型提供了实用、可扩展且鲁棒的容错解决方案。


Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?

Abstract

arXiv:2501.15857v4 Announce Type: replace Abstract: Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C=g(B)=g(f(A)) ) even without encountering ( ABC ) together, showcasing the generalization ability of human intelligence. In this paper, we introduce a synthetic learning task, "FTCT" (Fragmented at Training, Chained at Testing), to validate the potential of Transformers in replicating this skill and interpret its inner mechanism. In the training phase, data consist of separated knowledge fragments from an overall causal graph. During testing, Transformers must infer complete causal graph traces by integrating these fragments. Our findings demonstrate that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning on FTCT by revealing correct combinations of fragments, even if such combinations were absent in the training data. Furthermore, the emergence of compositional reasoning ability is strongly correlated with the model complexity and training-testing data similarity. We propose, both theoretically and empirically, that Transformers learn an underlying generalizable program from training, enabling effective compositional reasoning during testing.

摘要

人类展现出卓越的组合推理能力,能够整合来自不同来源的知识。例如,当人们从一个来源学习到(B = f(A)),从另一个来源获得(C = g(B))时,即使从未同时接触过(ABC),也能推导出(C=g(B)=g(f(A))),这体现了人类智能的泛化能力。本文提出一个名为"FTCT"(训练阶段碎片化,测试阶段链条化)的合成学习任务,用于验证Transformer模型在复现这种技能方面的潜力并解析其内部机制。在训练阶段,数据由整体因果图中分离的知识片段组成;测试阶段要求Transformer模型通过整合这些片段来推断完整的因果图轨迹。研究发现,少量样本的思维链提示能使Transformer模型在FTCT任务中展现组合推理能力——即使训练数据中从未出现过特定组合,模型仍能正确整合知识片段。进一步研究表明,这种组合推理能力的涌现与模型复杂度及训练-测试数据相似度呈强相关性。我们通过理论分析和实证验证提出:Transformer模型从训练数据中学习到了具有泛化性的底层程序,从而在测试阶段实现了有效的组合推理。


Agentic AI Systems Applied to tasks in Financial Services: Modeling and model risk management crews

Abstract

arXiv:2502.05439v2 Announce Type: replace Abstract: The advent of large language models has ushered in a new era of agentic systems, where artificial intelligence programs exhibit remarkable autonomous decision-making capabilities across diverse domains. This paper explores agentic system workflows in the financial services industry. In particular, we build agentic crews with human-in-the-loop module that can effectively collaborate to perform complex modeling and model risk management (MRM) tasks. The modeling crew consists of a judge agent and multiple agents who perform specific tasks such as exploratory data analysis, feature engineering, model selection/hyperparameter tuning, model training, model evaluation, and writing documentation. The MRM crew consists of a judge agent along with specialized agents who perform tasks such as checking compliance of modeling documentation, model replication, conceptual soundness, analysis of outcomes, and writing documentation. We demonstrate the effectiveness and robustness of modeling and MRM crews by presenting a series of numerical examples applied to credit card fraud detection, credit card approval, and portfolio credit risk modeling datasets.

摘要

大型语言模型的出现开创了智能代理系统的新纪元,这些人工智能程序在多元领域展现出卓越的自主决策能力。本文探讨金融服务行业中智能代理系统的工作流程。我们特别构建了包含人类参与模块的代理团队,能够高效协作完成复杂建模与模型风险管理(MRM)任务。建模团队由评审代理与多个功能代理组成,分别执行探索性数据分析、特征工程、模型选择/超参数调优、模型训练、模型评估及文档编写等专项任务。MRM团队则包含评审代理与专业代理,负责执行建模文档合规性检查、模型复现、概念合理性验证、结果分析及文档撰写等工作。通过信用卡欺诈检测、信用卡审批及组合信用风险建模数据集的一系列数值实验,我们验证了建模团队与MRM团队的有效性和鲁棒性。


LLM-driven Effective Knowledge Tracing by Integrating Dual-channel Difficulty

Abstract

arXiv:2502.19915v2 Announce Type: replace Abstract: Knowledge Tracing (KT) is a fundamental technology in intelligent tutoring systems used to simulate changes in students' knowledge state during learning, track personalized knowledge mastery, and predict performance. However, current KT models face three major challenges: (1) When encountering new questions, models face cold-start problems due to sparse interaction records, making precise modeling difficult; (2) Traditional models only use historical interaction records for student personalization modeling, unable to accurately track individual mastery levels, resulting in unclear personalized modeling; (3) The decision-making process is opaque to educators, making it challenging for them to understand model judgments. To address these challenges, we propose a novel Dual-channel Difficulty-aware Knowledge Tracing (DDKT) framework that utilizes Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) for subjective difficulty assessment, while integrating difficulty bias-aware algorithms and student mastery algorithms for precise difficulty measurement. Our framework introduces three key innovations: (1) Difficulty Balance Perception Sequence (DBPS) - students' subjective perceptions combined with objective difficulty, measuring gaps between LLM-assessed difficulty, mathematical-statistical difficulty, and students' subjective perceived difficulty through attention mechanisms; (2) Difficulty Mastery Ratio (DMR) - precise modeling of student mastery levels through different difficulty zones; (3) Knowledge State Update Mechanism - implementing personalized knowledge acquisition through gated networks and updating student knowledge state. Experimental results on two real datasets show our method consistently outperforms nine baseline models, improving AUC metrics by 2% to 10% while effectively addressing cold-start problems and enhancing model interpretability.

摘要

知识追踪(KT)是智能辅导系统中的核心技术,用于模拟学生学习过程中知识状态的变化、追踪个性化知识掌握程度并预测学习表现。然而当前KT模型面临三大挑战:(1)面对新题目时,模型因交互记录稀疏而遭遇冷启动问题,难以精确建模;(2)传统模型仅利用历史交互记录进行学生个性化建模,无法准确追踪个体掌握水平,导致个性化建模不清晰;(3)决策过程对教育者不透明,使其难以理解模型判断。针对这些挑战,我们提出新型双通道难度感知知识追踪框架(DDKT),利用大语言模型(LLM)和检索增强生成(RAG)进行主观难度评估,同时整合难度偏差感知算法与学生掌握度算法实现精准难度度量。该框架包含三项关键创新:(1)难度平衡感知序列(DBPS)——结合学生主观感知与客观难度,通过注意力机制衡量LLM评估难度、数理统计难度与学生主观感知难度间的差距;(2)难度掌握比率(DMR)——通过不同难度区间精准建模学生掌握水平;(3)知识状态更新机制——通过门控网络实现个性化知识获取并更新学生知识状态。在两个真实数据集上的实验表明,我们的方法始终优于九个基线模型,AUC指标提升2%至10%,同时有效解决冷启动问题并增强模型可解释性。


EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety

Abstract

arXiv:2504.09689v3 Announce Type: replace Abstract: The rise of LLM-driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions. EmoAgent comprises two components: EmoEval simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ-9, PDI, PANSS) to evaluate mental risks induced by LLM. EmoGuard serves as an intermediary, monitoring users' mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character-based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI-human interactions. Our code is available at: https://github.com/1akaman/EmoAgent

摘要

LLM驱动的AI角色兴起引发了安全隐患,尤其对存在心理障碍的脆弱人类用户构成风险。为解决这些问题,我们提出EmoAgent——一个用于评估和缓解人机交互中心理健康风险的多智能体AI框架。该框架包含两个组件:EmoEval通过模拟虚拟用户(包括心理脆弱个体)来评估与AI角色交互前后的心理健康变化,采用临床验证的心理与精神病学评估工具(PHQ-9、PDI、PANSS)量化LLM诱发的心理风险;EmoGuard作为中介系统,实时监测用户心理状态、预测潜在危害并提供矫正反馈以降低风险。在主流角色聊天机器人中的实验表明,情感互动对话可能导致34.4%以上的脆弱用户模拟场景出现心理状态恶化,而EmoGuard能显著降低恶化率,证实其在保障人机交互安全方面的作用。代码已开源:https://github.com/1akaman/EmoAgent


Round Trip Translation Defence against Large Language Model Jailbreaking Attacks

Abstract

arXiv:2402.13517v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are susceptible to social-engineered attacks that are human-interpretable but require a high level of comprehension for LLMs to counteract. Existing defensive measures can only mitigate less than half of these attacks at most. To address this issue, we propose the Round Trip Translation (RTT) method, the first algorithm specifically designed to defend against social-engineered attacks on LLMs. RTT paraphrases the adversarial prompt and generalizes the idea conveyed, making it easier for LLMs to detect induced harmful behavior. This method is versatile, lightweight, and transferrable to different LLMs. Our defense successfully mitigated over 70% of Prompt Automatic Iterative Refinement (PAIR) attacks, which is currently the most effective defense to the best of our knowledge. We are also the first to attempt mitigating the MathsAttack and reduced its attack success rate by almost 40%. Our code is publicly available at https://github.com/Cancanxxx/Round_Trip_Translation_Defence This version of the article has been accepted for publication, after peer review (when applicable) but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.48550/arXiv.2402.13517 Use of this Accepted Version is subject to the publisher's Accepted Manuscript terms of use https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms

摘要

大型语言模型(LLMs)易受社会工程攻击,这类攻击对人类可解释但需要LLMs具备较高理解能力才能抵御。现有防御措施最多仅能缓解不到一半的攻击。为解决该问题,我们提出循环翻译(RTT)方法——首个专门设计用于防御LLMs社会工程攻击的算法。RTT通过复述对抗性提示并泛化其传达的思想,使LLMs更容易识别诱导性有害行为。该方法具有通用性、轻量级特点,可迁移至不同LLMs。我们的防御成功阻断了超过70%的提示自动迭代优化(PAIR)攻击,据我们所知这是当前最有效的防御方案。我们也是首个尝试缓解数学攻击(MathsAttack)的团队,将其攻击成功率降低了近40%。代码已开源:https://github.com/Cancanxxx/Round_Trip_Translation_Defence。本文版本已通过同行评审并被录用,但非最终出版版本,不包含录用后的修改或更正。最终版本详见:https://doi.org/10.48550/arXiv.2402.13517。


Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-based Decision-Making Systems

Abstract

arXiv:2405.20774v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have shown significant promise in real-world decision-making tasks for embodied artificial intelligence, especially when fine-tuned to leverage their inherent common sense and reasoning abilities while being tailored to specific applications. However, this fine-tuning process introduces considerable safety and security vulnerabilities, especially in safety-critical cyber-physical systems. In this work, we propose the first comprehensive framework for Backdoor Attacks against LLM-based Decision-making systems (BALD) in embodied AI, systematically exploring the attack surfaces and trigger mechanisms. Specifically, we propose three distinct attack mechanisms: word injection, scenario manipulation, and knowledge injection, targeting various components in the LLM-based decision-making pipeline. We perform extensive experiments on representative LLMs (GPT-3.5, LLaMA2, PaLM2) in autonomous driving and home robot tasks, demonstrating the effectiveness and stealthiness of our backdoor triggers across various attack channels, with cases like vehicles accelerating toward obstacles and robots placing knives on beds. Our word and knowledge injection attacks achieve nearly 100% success rate across multiple models and datasets while requiring only limited access to the system. Our scenario manipulation attack yields success rates exceeding 65%, reaching up to 90%, and does not require any runtime system intrusion. We also assess the robustness of these attacks against defenses, revealing their resilience. Our findings highlight critical security vulnerabilities in embodied LLM systems and emphasize the urgent need for safeguarding these systems to mitigate potential risks.

摘要

大型语言模型(LLMs)在具身人工智能的实际决策任务中展现出显著潜力,特别是在经过微调以利用其固有常识与推理能力并适配特定应用场景时。然而,这种微调过程会引入严重的安全隐患,尤其在安全关键的信息物理系统中。本研究首次提出针对具身AI中基于LLM决策系统的后门攻击框架(BALD),系统性地探索攻击面与触发机制。具体而言,我们提出三种攻击机制:词汇注入、场景操控和知识注入,分别针对基于LLM决策流程中的不同环节。我们在自动驾驶和家庭机器人任务中对代表性LLM(GPT-3.5、LLaMA2、PaLM2)开展大量实验,证明后门触发器在多种攻击渠道下的有效性与隐蔽性,包括车辆加速冲向障碍物、机器人将刀具放置于床铺等案例。我们的词汇与知识注入攻击在多个模型和数据集上实现近100%成功率,且仅需有限系统访问权限;场景操控攻击成功率超过65%,最高达90%,且无需运行时系统侵入。我们还评估了这些攻击对防御措施的鲁棒性,揭示其强韧性。研究结果凸显了具身LLM系统的关键安全漏洞,强调亟需建立防护机制以降低潜在风险。


Demystifying AI Platform Design for Distributed Inference of Next-Generation LLM models

Abstract

arXiv:2406.01698v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have shown remarkable performance across a wide range of applications, often outperforming human experts. However, deploying these gigantic models efficiently for diverse inference use cases requires carefully designed hardware platforms with ample computing, memory, and network resources. With constant innovation in LLM serving optimizations and model architecture evolving at breakneck speed, the hardware requirements to meet Service Level Objectives (SLOs) remain an open research question. To answer the question, we present an analytical tool, GenZ, to efficiently navigate the relationship between diverse LLM model architectures(Dense, GQA, MoE, Mamba), LLM serving optimizations(Chunking, Speculative decoding, quanitization), and AI platform design parameters. Our tool estimates LLM inference performance metrics for the given scenario. We have validated against real hardware platforms running various different LLM models, achieving a max geomean error of 5.82.We use GenZ to identify compute, memory capacity, memory bandwidth, network latency, and network bandwidth requirements across diverse LLM inference use cases. We also study diverse architectural choices in use today (inspired by LLM serving platforms from several vendors) to help inform computer architects designing next-generation AI hardware accelerators and platforms. The trends and insights derived from GenZ can guide AI engineers deploying LLMs as well as computer architects designing next-generation hardware accelerators and platforms. Ultimately, this work sheds light on the platform design considerations for unlocking the full potential of large language models across a spectrum of applications. The source code is available at https://github.com/abhibambhaniya/GenZ-LLM-Analyzer . Users can also be tried it on at https://genz-llm-analyzer.streamlit.app/ without any setup on your web browser.

摘要

大型语言模型(LLMs)在广泛的应用中展现出卓越性能,往往超越人类专家水平。然而,为多样化推理用例高效部署这些庞大模型,需要精心设计具备充足计算、内存和网络资源的硬件平台。随着LLM服务优化技术的持续创新和模型架构的飞速演进,满足服务水平目标(SLOs)所需的硬件要求仍是待解的研究问题。

为此,我们提出分析工具GenZ,用于系统探索多样化LLM架构(稠密模型、分组查询注意力、混合专家、Mamba)、LLM服务优化技术(分块处理、推测解码、量化)与AI平台设计参数之间的关联关系。该工具可估算给定场景下的LLM推理性能指标。我们在运行各类LLM模型的真实硬件平台上进行验证,最大几何平均误差为5.82%。通过GenZ,我们量化了不同LLM推理用例对计算力、内存容量、内存带宽、网络延迟及网络带宽的需求,并研究了当前主流架构选择(受多家厂商LLM服务平台启发),为设计下一代AI硬件加速器与平台的计算机架构师提供参考。

GenZ揭示的趋势与洞见既能指导AI工程师部署LLM,也可协助计算机架构师设计新一代硬件加速器与平台。本研究最终为充分释放大型语言模型跨应用潜力所需的平台设计考量提供了理论依据。


Lossless data compression by large models

Abstract

arXiv:2407.07723v3 Announce Type: replace-cross Abstract: Modern data compression methods are slowly reaching their limits after 80 years of research, millions of papers, and wide range of applications. Yet, the extravagant 6G communication speed requirement raises a major open question for revolutionary new ideas of data compression. We have previously shown all understanding or learning are compression, under reasonable assumptions. Large language models (LLMs) understand data better than ever before. Can they help us to compress data? The LLMs may be seen to approximate the uncomputable Solomonoff induction. Therefore, under this new uncomputable paradigm, we present LMCompress. LMCompress shatters all previous lossless compression algorithms, doubling the lossless compression ratios of JPEG-XL for images, FLAC for audios, and H.264 for videos, and quadrupling the compression ratio of bz2 for texts. The better a large model understands the data, the better LMCompress compresses.

摘要

经过80年的研究、数百万篇论文和广泛的应用,现代数据压缩方法正逐渐接近其极限。然而,6G通信对传输速度的苛刻要求,为数据压缩领域的革命性新思路提出了一个重大开放性问题。我们先前的研究表明,在合理假设下,所有理解或学习过程本质上都是数据压缩。大型语言模型(LLMs)对数据的理解能力达到了前所未有的水平。它们能否帮助我们压缩数据?LLMs可被视为对不可计算的所罗门诺夫归纳的近似。因此,在这一新的不可计算范式下,我们提出了LMCompress。该算法彻底超越了所有现有无损压缩方法:对图像的压缩率是JPEG-XL的两倍,对音频是FLAC的两倍,对视频是H.264的两倍,对文本的压缩率更是达到bz2的四倍。大型模型对数据的理解越深入,LMCompress的压缩效果就越出色。


Let Network Decide What to Learn: Symbolic Music Understanding Model Based on Large-scale Adversarial Pre-training

Abstract

arXiv:2407.08306v3 Announce Type: replace-cross Abstract: As a crucial aspect of Music Information Retrieval (MIR), Symbolic Music Understanding (SMU) has garnered significant attention for its potential to assist both musicians and enthusiasts in learning and creating music. Recently, pre-trained language models have been widely adopted in SMU due to the substantial similarities between symbolic music and natural language, as well as the ability of these models to leverage limited music data effectively. However, some studies have shown the common pre-trained methods like Mask Language Model (MLM) may introduce bias issues like racism discrimination in Natural Language Process (NLP) and affects the performance of downstream tasks, which also happens in SMU. This bias often arises when masked tokens cannot be inferred from their context, forcing the model to overfit the training set instead of generalizing. To address this challenge, we propose Adversarial-MidiBERT for SMU, which adaptively determines what to mask during MLM via a masker network, rather than employing random masking. By avoiding the masking of tokens that are difficult to infer from context, our model is better equipped to capture contextual structures and relationships, rather than merely conforming to the training data distribution. We evaluate our method across four SMU tasks, and our approach demonstrates excellent performance in all cases. The code for our model is publicly available at https://github.com/RS2002/Adversarial-MidiBERT .

摘要

作为音乐信息检索(MIR)的关键领域,符号音乐理解(SMU)因其在辅助音乐从业者与爱好者学习创作方面的潜力而备受关注。鉴于符号音乐与自然语言的高度相似性,以及预训练语言模型对有限音乐数据的高效利用能力,此类模型已在SMU领域得到广泛应用。然而研究表明,类似掩码语言模型(MLM)的常规预训练方法可能引发自然语言处理中存在的种族歧视等偏见问题,并影响下游任务性能,这种现象在SMU中同样存在。当被掩码标记无法通过上下文推断时,模型会过度拟合训练数据而非实现泛化,进而导致偏差产生。针对这一挑战,我们提出面向SMU的对抗式MidiBERT模型,其通过掩码器网络自适应地确定MLM过程中的掩码对象,而非采用随机掩码策略。该方法通过避免掩码难以推断的上下文标记,使模型更专注于捕捉语境结构与关联关系,而非简单拟合训练数据分布。我们在四项SMU任务上评估本方法,实验结果表明该模型在所有案例中均表现优异。模型代码已公开于https://github.com/RS2002/Adversarial-MidiBERT。


Patched RTC: evaluating LLMs for diverse software development tasks

Abstract

arXiv:2407.16557v3 Announce Type: replace-cross Abstract: This paper introduces Patched Round-Trip Correctness (Patched RTC), a novel evaluation technique for Large Language Models (LLMs) applied to diverse software development tasks, particularly focusing on "outer loop" activities such as bug fixing, code review, and documentation updates. Patched RTC extends the original Round-Trip Correctness method to work with any LLM and downstream task, offering a self-evaluating framework that measures consistency and robustness of model responses without human intervention. The study demonstrates a correlation between Patched RTC scores and task-specific accuracy metrics, presenting it as an alternative to the LLM-as-Judge paradigm for open-domain task evaluation. We implement Patched RTC in an open-source framework called patchwork, allowing for transparent evaluation during inference across various patchflows. Experiments comparing GPT-3.5 and GPT-4 models across different software development tasks reveal that Patched RTC effectively distinguishes model performance and task difficulty. The paper also explores the impact of consistency prompts on improving model accuracy, suggesting that Patched RTC can guide prompt refinement and model selection for complex software development workflows.

摘要

本文介绍了"修补版往返正确性"(Patched RTC)这一新型大语言模型评估技术,该技术适用于多种软件开发任务,尤其专注于错误修复、代码审查和文档更新等"外循环"活动。Patched RTC将原始往返正确性方法扩展至适用于任何大语言模型和下游任务,提供了一个无需人工干预即可衡量模型响应一致性与鲁棒性的自评估框架。研究表明Patched RTC评分与任务特定准确度指标存在相关性,可作为开放领域任务评估中"LLM-as-Judge"范式的替代方案。我们在名为patchwork的开源框架中实现了Patched RTC,支持跨多种修补流程在推理过程中进行透明评估。通过比较GPT-3.5和GPT-4模型在不同软件开发任务中的表现,实验表明Patched RTC能有效区分模型性能和任务难度。本文还探讨了一致性提示对提升模型准确率的影响,表明Patched RTC可为复杂软件开发工作流中的提示优化和模型选择提供指导。


Patched MOA: optimizing inference for diverse software development tasks

Abstract

arXiv:2407.18521v4 Announce Type: replace-cross Abstract: This paper introduces Patched MOA (Mixture of Agents), an inference optimization technique that significantly enhances the performance of large language models (LLMs) across diverse software development tasks. We evaluate three inference optimization algorithms - Best of N, Mixture of Agents, and Monte Carlo Tree Search and demonstrate that Patched MOA can boost the performance of smaller models to surpass that of larger, more expensive models. Notably, our approach improves the gpt-4o-mini model's performance on the Arena-Hard-Auto benchmark by 15.52%, outperforming gpt-4-turbo at a fraction of the cost. We also apply Patched MOA to various software development workflows, showing consistent improvements in task completion rates. Our method is model-agnostic, transparent to end-users, and can be easily integrated into existing LLM pipelines. This work contributes to the growing field of LLM optimization, offering a cost-effective solution for enhancing model performance without the need for fine-tuning or larger models. Our implementation is open-source and available at https://github.com/codelion/optillm.

摘要

本文介绍了Patched MOA(混合代理)这一推理优化技术,该技术能显著提升大语言模型(LLM)在各类软件开发任务中的性能表现。我们评估了三种推理优化算法——N选最优法、代理混合法和蒙特卡洛树搜索法,并证明Patched MOA可使较小模型的性能超越更大、更昂贵的模型。值得注意的是,该方法使gpt-4o-mini模型在Arena-Hard-Auto基准测试中的性能提升了15.52%,以极低成本超越了gpt-4-turbo的表现。我们还将Patched MOA应用于多种软件开发工作流,在任务完成率方面均展现出持续改进。该方法具有模型无关性、对终端用户透明等特点,可轻松集成至现有LLM流程中。本研究为LLM优化领域提供了无需微调或更大模型的经济高效性能提升方案,相关实现已开源(https://github.com/codelion/optillm)。


Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction

Abstract

arXiv:2409.15551v2 Announce Type: replace-cross Abstract: Annotating and recognizing speech emotion using prompt engineering has recently emerged with the advancement of Large Language Models (LLMs), yet its efficacy and reliability remain questionable. In this paper, we conduct a systematic study on this topic, beginning with the proposal of novel prompts that incorporate emotion-specific knowledge from acoustics, linguistics, and psychology. Subsequently, we examine the effectiveness of LLM-based prompting on Automatic Speech Recognition (ASR) transcription, contrasting it with ground-truth transcription. Furthermore, we propose a Revise-Reason-Recognize prompting pipeline for robust LLM-based emotion recognition from spoken language with ASR errors. Additionally, experiments on context-aware learning, in-context learning, and instruction tuning are performed to examine the usefulness of LLM training schemes in this direction. Finally, we investigate the sensitivity of LLMs to minor prompt variations. Experimental results demonstrate the efficacy of the emotion-specific prompts, ASR error correction, and LLM training schemes for LLM-based emotion recognition. Our study aims to refine the use of LLMs in emotion recognition and related domains.

摘要

基于提示工程的语音情感标注与识别技术随着大语言模型(LLMs)的发展而兴起,但其有效性和可靠性仍存疑。本文对此展开系统研究:首先提出融合声学、语言学和心理学领域情感知识的新型提示模板;其次考察基于LLM的提示方法在自动语音识别(ASR)转文本上的有效性,并与真实转文本进行对比;进而提出"修正-推理-识别"的三阶段提示流程,用于从含ASR错误的语音中实现鲁棒的LLM情感识别。此外,通过上下文感知学习、上下文内学习和指令调优实验,探究LLM训练方案在该领域的适用性。最后研究了LLM对提示细微变化的敏感性。实验结果表明,情感专用提示模板、ASR纠错机制和LLM训练方案能有效提升基于LLM的情感识别性能。本研究旨在优化LLM在情感识别及相关领域的应用。


Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling

Abstract

arXiv:2409.16937v3 Announce Type: replace-cross Abstract: The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data for training the classification model. Acoustically, unlabeled data are compared to labeled data using the Frechet audio distance, calculated from embeddings generated by multiple audio encoders. Linguistically, large language models are prompted to revise automatic speech recognition transcriptions and predict labels based on our proposed task-specific knowledge. High-confidence data are identified when pseudo-labels from both sources align, while mismatches are treated as low-confidence data. A bimodal classifier is then trained to iteratively label the low-confidence data until a predefined criterion is met. We evaluate our SSL framework on emotion recognition and dementia detection tasks. Experimental results demonstrate that our method achieves competitive performance compared to fully supervised learning using only 30% of the labeled data and significantly outperforms two selected baselines.

摘要

标记数据不足是语音分类任务中常见的挑战,尤其在需要大量主观评估的任务(如认知状态分类)中更为突出。本研究提出一种半监督学习框架,通过引入新颖的多视角伪标签生成方法,利用声学与语言学特征联合筛选高置信度数据用于分类模型训练。在声学层面,我们通过多种音频编码器生成嵌入向量,基于弗雷歇音频距离度量未标记数据与已标记数据的相似性。在语言层面,采用大语言模型对自动语音识别文本进行修正,并基于我们提出的任务相关知识进行标签预测。当两种来源的伪标签一致时判定为高置信度数据,存在分歧时则视为低置信度数据。随后训练双模态分类器对低置信度数据进行迭代标注,直至满足预设条件。我们在情感识别和痴呆症检测任务上评估该半监督学习框架,实验结果表明:仅使用30%标记数据时,本方法性能即可媲美全监督学习,并显著优于两个选定基线模型。


MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization

Abstract

arXiv:2411.05282v4 Announce Type: replace-cross Abstract: Quantization of foundational models (FMs) is significantly more challenging than traditional DNNs due to the emergence of large magnitude values called outliers. Existing outlier-aware algorithm-architecture co-design techniques either use mixed-precision, retaining outliers at high precision but compromise hardware efficiency, or quantize inliers and outliers at the same precision, improving hardware efficiency at the cost of accuracy. To address this mutual exclusivity, we propose MicroScopiQ, a novel co-design technique that leverages pruning to complement outlier-aware quantization. MicroScopiQ retains outliers at higher precision while pruning a certain fraction of least important weights to distribute the additional outlier bits; ensuring high accuracy, aligned memory and hardware efficiency. We design a high-throughput, low overhead accelerator architecture composed of multi-precision INT processing elements and a network-on-chip called ReCoN that efficiently abstracts the complexity of supporting high-precision outliers. Additionally, unlike prior techniques, MicroScopiQ does not assume any locality of outlier weights, enabling applicability to a broad range of FMs. Extensive experiments across diverse quantization settings demonstrate that MicroScopiQ achieves state-of-the-art quantization accuracy, while delivering up to 3x faster inference and 2x lower energy consumption compared to existing alternatives. Code is available at: https://github.com/georgia-tech-synergy-lab/MicroScopiQ-LLM-Quantization

摘要

基础模型(FMs)的量化比传统深度神经网络(DNNs)更具挑战性,主要由于存在称为异常值的大幅值数据。现有的异常值感知算法-架构协同设计技术要么采用混合精度(以硬件效率为代价保留高精度异常值),要么以相同精度量化正常值与异常值(通过牺牲精度提升硬件效率)。为解决这种互斥性,我们提出MicroScopiQ——一种利用剪枝辅助异常值感知量化的新型协同设计技术。该方法通过保留高精度异常值并剪除部分最不重要权重以分配额外异常值比特,在确保高精度的同时实现内存对齐与硬件效率。我们设计了一种由多精度INT处理单元和片上网络ReCoN组成的高吞吐量、低开销加速器架构,其能高效抽象支持高精度异常值的复杂性。此外,与现有技术不同,MicroScopiQ不假设异常值权重的任何局部性,可广泛应用于各类基础模型。多样化量化设置下的实验表明,MicroScopiQ在实现最先进量化精度的同时,相比现有方案推理速度提升达3倍,能耗降低达2倍。代码已开源:https://github.com/georgia-tech-synergy-lab/MicroScopiQ-LLM-Quantization


FILA: Fine-Grained Vision Language Models

Abstract

arXiv:2412.08378v3 Announce Type: replace-cross Abstract: Recently, there has been growing interest in the capability of multimodal large language models (MLLMs) to process high-resolution images. A common approach currently involves dynamically cropping the original high-resolution image into smaller sub-images, which are then fed into a vision encoder that was pre-trained on lower-resolution images. However, this cropping approach often truncates objects and connected areas in the original image, causing semantic breaks. To address this limitation, we introduce HyViLM, designed to process images of any resolution while retaining the overall context during encoding. Specifically, we: (i) Design a new visual encoder called Hybrid Encoder that not only encodes individual sub-images but also interacts with detailed global visual features, significantly improving the model's ability to encode high-resolution images. (ii) Propose an optimal feature fusion strategy for the dynamic cropping approach, effectively leveraging information from different layers of the vision encoder. Compared with the state-of-the-art MLLMs under the same setting, our HyViLM outperforms existing MLLMs in nine out of ten tasks. Specifically, HyViLM achieves a 9.6% improvement in performance on the TextVQA task and a 6.9% enhancement on the DocVQA task.

摘要

近年来,多模态大语言模型(MLLMs)处理高分辨率图像的能力日益受到关注。当前主流方法通常将原始高分辨率图像动态裁剪为若干子图像,再输入至基于低分辨率图像预训练的视觉编码器。然而这种裁剪方式往往会导致原始图像中物体及关联区域的语义断裂。为突破这一局限,我们提出HyViLM模型,其能够在编码过程中保持整体语义上下文的同时处理任意分辨率图像。具体而言:(i)设计新型混合视觉编码器(Hybrid Encoder),不仅对子图像进行编码,还能与全局视觉特征进行细粒度交互,显著提升模型对高分辨率图像的编码能力;(ii)提出动态裁剪方法的最优特征融合策略,有效利用视觉编码器不同层级的信息。在相同实验设置下,相较于现有最优MLLMs,HyViLM在十项任务中有九项表现更优。具体而言,该模型在TextVQA任务上性能提升9.6%,在DocVQA任务上提升6.9%。


You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects

Abstract

arXiv:2412.10133v2 Announce Type: replace-cross Abstract: The ability to execute the test suite of a project is essential in many scenarios, e.g., to assess code quality and code coverage, to validate code changes made by developers or automated tools, and to ensure compatibility with dependencies. Despite its importance, executing the test suite of a project can be challenging in practice because different projects use different programming languages, software ecosystems, build systems, testing frameworks, and other tools. These challenges make it difficult to create a reliable, universal test execution method that works across different projects. This paper presents ExecutionAgent, an automated technique that prepares scripts for building an arbitrary project from source code and running its test cases. Inspired by the way a human developer would address this task, our approach is a large language model (LLM)-based agent that autonomously executes commands and interacts with the host system. The agent uses meta-prompting to gather guidelines on the latest technologies related to the given project, and it iteratively refines its process based on feedback from the previous steps. Our evaluation applies ExecutionAgent to 50 open-source projects that use 14 different programming languages and many different build and testing tools. The approach successfully executes the test suites of 33/50 projects, while matching the test results of ground truth test suite executions with a deviation of only 7.5%. These results improve over the best previously available technique by 6.6x. The costs imposed by the approach are reasonable, with an execution time of 74 minutes and LLM costs of USD 0.16, on average per project. We envision ExecutionAgent to serve as a valuable tool for developers, automated programming tools, and researchers that need to execute tests across a wide variety of projects.

摘要

执行项目测试套件的能力在许多场景中至关重要,例如评估代码质量和覆盖率、验证开发者或自动化工具所做的代码变更,以及确保与依赖项的兼容性。尽管其重要性不言而喻,但在实践中执行项目测试套件可能面临诸多挑战,因为不同项目使用不同的编程语言、软件生态系统、构建系统、测试框架及其他工具。这些挑战使得创建一种可靠、通用的跨项目测试执行方法变得困难。本文提出ExecutionAgent,这是一种自动化技术,能够为从源代码构建任意项目并运行其测试用例准备脚本。受人类开发者处理此类任务方式的启发,我们的方法基于大型语言模型(LLM)构建了一个能够自主执行命令并与主机系统交互的智能体。该智能体通过元提示(meta-prompting)获取与给定项目相关的最新技术指南,并根据前序步骤的反馈迭代优化其执行流程。我们在评估中将ExecutionAgent应用于50个开源项目,这些项目涉及14种编程语言及多种构建与测试工具。该方法成功执行了33/50项目的测试套件,且与基准测试套件执行结果的偏差仅为7.5%。这些结果较此前最优技术提升了6.6倍。该方法的成本处于合理范围,平均每个项目的执行时间为74分钟,LLM成本为0.16美元。我们期望ExecutionAgent能成为开发者、自动化编程工具及研究人员在处理多样化项目测试执行时的有力工具。


SAGE: A Framework of Precise Retrieval for RAG

Abstract

arXiv:2503.01713v2 Announce Type: replace-cross Abstract: Retrieval-augmented generation (RAG) has demonstrated significant proficiency in conducting question-answering (QA) tasks within a specified corpus. Nonetheless, numerous failure instances of RAG in QA still exist. These failures are not solely attributable to the limitations of Large Language Models (LLMs); instead, they predominantly arise from the retrieval of inaccurate information for LLMs due to two limitations: (1) Current RAG methods segment the corpus without considering semantics, making it difficult to find relevant context due to impaired correlation between questions and the segments. (2) There is a trade-off between missing essential context with fewer context retrieved and getting irrelevant context with more context retrieved. In this paper, we introduce a RAG framework (SAGE), to overcome these limitations. First, to address the segmentation issue without considering semantics, we propose to train a semantic segmentation model. This model is trained to segment the corpus into semantically complete chunks. Second, to ensure that only the most relevant chunks are retrieved while the irrelevant ones are ignored, we design a chunk selection algorithm to dynamically select chunks based on the decreasing speed of the relevance score, leading to a more relevant selection. Third, to further ensure the precision of the retrieved chunks, we propose letting LLMs assess whether retrieved chunks are excessive or lacking and then adjust the amount of context accordingly. Experiments show that SAGE outperforms baselines by 61.25% in the quality of QA on average. Moreover, by avoiding retrieving noisy context, SAGE lowers the cost of the tokens consumed in LLM inference and achieves a 49.41% enhancement in cost efficiency on average. Additionally, our work offers valuable insights for boosting RAG.

摘要

检索增强生成(RAG)技术在特定语料库的问答(QA)任务中展现出显著优势。然而,RAG在QA场景中仍存在大量失败案例。这些失败不仅源于大语言模型(LLM)的局限性,更主要归因于检索过程中因两大缺陷导致LLM获取不准确信息:(1)现有RAG方法进行语料分割时未考虑语义,导致问题与文本片段间关联性受损,难以定位相关上下文;(2)检索上下文数量存在固有矛盾——较少上下文易遗漏关键信息,较多上下文则引入无关内容。

本文提出SAGE框架以突破这些限制。首先,针对语义无关的分割问题,我们训练语义分割模型将语料切分为语义完整的文本块。其次,设计动态块选择算法,基于相关性分数下降速度筛选最相关文本块并剔除无关内容。第三,为进一步确保检索精度,通过LLM评估检索文本块是否过量或不足,据此调整上下文数量。实验表明,SAGE在QA质量上平均超越基线方法61.25%。通过避免噪声上下文检索,SAGE降低LLM推理的token消耗成本,平均提升49.41%的成本效益。本研究为增强RAG系统提供了重要启示。


Abstract

arXiv:2503.14258v3 Announce Type: replace-cross Abstract: This paper introduces JuDGE (Judgment Document Generation Evaluation), a novel benchmark for evaluating the performance of judgment document generation in the Chinese legal system. We define the task as generating a complete legal judgment document from the given factual description of the case. To facilitate this benchmark, we construct a comprehensive dataset consisting of factual descriptions from real legal cases, paired with their corresponding full judgment documents, which serve as the ground truth for evaluating the quality of generated documents. This dataset is further augmented by two external legal corpora that provide additional legal knowledge for the task: one comprising statutes and regulations, and the other consisting of a large collection of past judgment documents. In collaboration with legal professionals, we establish a comprehensive automated evaluation framework to assess the quality of generated judgment documents across various dimensions. We evaluate various baseline approaches, including few-shot in-context learning, fine-tuning, and a multi-source retrieval-augmented generation (RAG) approach, using both general and legal-domain LLMs. The experimental results demonstrate that, while RAG approaches can effectively improve performance in this task, there is still substantial room for further improvement. All the codes and datasets are available at: https://github.com/oneal2000/JuDGE.

摘要

本文提出JuDGE(裁判文书生成评估基准),这是一个用于评估中国法律体系下裁判文书生成性能的新型基准。我们将该任务定义为:根据给定案件事实描述生成完整的裁判文书。为构建该基准,我们创建了一个综合性数据集,包含真实案件的事实描述及其对应的完整裁判文书(作为生成文书质量的评估标准),并通过两个外部法律语料库进行增强:一个包含法律法规条文,另一个由大量历史裁判文书构成。在与法律专业人士合作下,我们建立了全面的自动化评估框架,用于多维度评估生成裁判文书的质量。我们评估了多种基线方法,包括小样本上下文学习、微调以及多源检索增强生成(RAG)方法,测试对象涵盖通用领域和法律领域的大语言模型。实验结果表明,虽然RAG方法能有效提升任务表现,但仍存在显著改进空间。所有代码和数据集已开源:https://github.com/oneal2000/JuDGE。


Weight Ensembling Improves Reasoning in Language Models

Abstract

arXiv:2504.10478v3 Announce Type: replace-cross Abstract: We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades off between bias and variance.

摘要

我们研究了一种在推理模型训练过程中出现的失效模式:生成多样性开始崩溃,导致测试时扩展效果欠佳。值得注意的是,在监督微调(SFT)阶段,Pass@1指标持续提升,但Pass@k指标却快速恶化。令人惊讶的是,仅需将最新SFT检查点与早期检查点进行权重插值(即WiSE-FT方法),就能几乎完全恢复Pass@k性能,同时还能提升Pass@1指标。采用WiSE-FT变体的模型实现了更优的测试时扩展性能(Best@k和多数投票),且当进一步通过强化学习调优时,能以更少数据获得更优结果。最后,我们发现WiSE-FT能带来无法仅通过温度调节等多样性诱导解码策略实现的互补性性能提升。我们形式化地建立了Pass@k关于测试分布上Pass@1期望与方差的偏差-方差权衡关系,发现WiSE-FT能同时降低偏差和方差,而温度调节则需要在偏差与方差之间进行固有权衡。


FinSage: A Multi-aspect RAG System for Financial Filings Question Answering

Abstract

arXiv:2504.14493v2 Announce Type: replace-cross Abstract: Leveraging large language models in real-world settings often entails a need to utilize domain-specific data and tools in order to follow the complex regulations that need to be followed for acceptable use. Within financial sectors, modern enterprises increasingly rely on Retrieval-Augmented Generation (RAG) systems to address complex compliance requirements in financial document workflows. However, existing solutions struggle to account for the inherent heterogeneity of data (e.g., text, tables, diagrams) and evolving nature of regulatory standards used in financial filings, leading to compromised accuracy in critical information extraction. We propose the FinSage framework as a solution, utilizing a multi-aspect RAG framework tailored for regulatory compliance analysis in multi-modal financial documents. FinSage introduces three innovative components: (1) a multi-modal pre-processing pipeline that unifies diverse data formats and generates chunk-level metadata summaries, (2) a multi-path sparse-dense retrieval system augmented with query expansion (HyDE) and metadata-aware semantic search, and (3) a domain-specialized re-ranking module fine-tuned via Direct Preference Optimization (DPO) to prioritize compliance-critical content. Extensive experiments demonstrate that FinSage achieves an impressive recall of 92.51% on 75 expert-curated questions derived from surpasses the best baseline method on the FinanceBench question answering datasets by 24.06% in accuracy. Moreover, FinSage has been successfully deployed as financial question-answering agent in online meetings, where it has already served more than 1,200 people.

摘要

在现实场景中运用大型语言模型时,通常需要利用领域特定数据和工具以满足复杂合规要求。金融领域内,现代企业日益依赖检索增强生成(RAG)系统来处理金融文档工作流中的复杂合规需求。然而,现有方案难以应对金融文件中数据固有异构性(如文本、表格、图表)及监管标准持续演变的特性,导致关键信息提取准确性受损。我们提出FinSage框架作为解决方案,该框架采用专为多模态金融文档合规分析定制的多维度RAG架构。FinSage包含三个创新组件:(1) 统一多源数据格式并生成分块级元数据摘要的多模态预处理流程,(2) 结合查询扩展(HyDE)与元数据感知语义搜索的多路径稀疏-稠密检索系统,(3) 通过直接偏好优化(DPO)微调的领域专业化重排序模块,优先识别合规关键内容。大量实验表明,FinSage在75个专家编制问题上实现了92.51%的召回率,在FinanceBench问答数据集上的准确率超越最佳基线方法24.06%。此外,FinSage已作为金融问答代理成功部署于在线会议场景,累计服务超1,200人。


Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Abstract

arXiv:2504.19254v2 Announce Type: replace-cross Abstract: Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.

摘要

幻觉问题是大型语言模型(LLMs)长期存在的缺陷。随着这些模型在医疗保健和金融等高风险领域的广泛应用,建立有效的幻觉检测机制变得至关重要。为此,我们提出了一种零资源幻觉检测的通用框架,可供从业者应用于实际场景。我们通过改造多种现有不确定性量化(UQ)技术实现这一目标,包括黑盒UQ、白盒UQ和LLM-as-a-Judge等方法,将其转化为标准化的响应级置信度评分(0-1范围)。为增强灵活性,我们引入了一种可调集成方案,能够融合任意组合的个体置信度评分,使从业者能针对特定应用场景优化集成方案以获得更佳性能。为便于实施,本文配套Python工具包UQLM提供了完整的评分器套件。我们通过多个LLM问答基准测试开展了全面实验评估,发现可调集成方案通常优于其各组成组件,且表现超越现有幻觉检测方法。实验结果证明了定制化幻觉检测策略对于提升LLMs准确性与可靠性的显著优势。


LLMs for Engineering: Teaching Models to Design High Powered Rockets

Abstract

arXiv:2504.19394v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have transformed software engineering, but their application to physical engineering domains remains underexplored. This paper evaluates LLMs' capabilities in high-powered rocketry design through RocketBench, a benchmark connecting LLMs to high-fidelity rocket simulations. We test models on two increasingly complex design tasks: target altitude optimization and precision landing challenges. Our findings reveal that while state-of-the-art LLMs demonstrate strong baseline engineering knowledge, they struggle to iterate on their designs when given simulation results and ultimately plateau below human performance levels. However, when enhanced with reinforcement learning (RL), we show that a 7B parameter model outperforms both SoTA foundation models and human experts. This research demonstrates that RL-trained LLMs can serve as effective tools for complex engineering optimization, potentially transforming engineering domains beyond software development.

摘要

大语言模型(LLMs)已深刻改变了软件工程领域,但其在物理工程领域的应用仍待深入探索。本文通过RocketBench基准测试(一种将LLMs与高保真火箭模拟相连接的评估框架),系统评估了LLMs在高功率火箭设计中的能力。我们针对两项复杂度递增的设计任务进行测试:目标高度优化与精准着陆挑战。研究发现,尽管最先进的LLMs展现出扎实的基础工程知识,但在获得模拟反馈后难以有效迭代设计方案,最终性能停滞在人类水平之下。然而,当采用强化学习(RL)增强后,一个70亿参数的模型表现超越了现有基础模型和人类专家。本研究表明,经过RL训练的LLMs能成为复杂工程优化的有效工具,有望在软件开发之外的工程领域引发变革。